Counting the Invisible
PEOPLE |

Counting the Invisible

ESTIMATING THE NUMBER OF WAR VICTIMS THROUGH INCOMPLETE REPORTS IS A CHALLENGING STATISTICAL PROBLEM THAT GIACOMO ZANELLA HAS TACKLED FROM BOTH A THEORETICAL AND A COMPUTATIONAL POINT OF VIEW

Even in the big data era, access to information can be limited by a variety of factors, ranging from political to practical. This is also the case of casualty records in war scenarios, which often consist of multiple, incomplete and potentially inaccurate lists (for example from different NGOs) instead of a unique and exhaustive official registry. Giacomo Zanella, an Assistant Professor in Statistics at Bocconi Department of Decision Sciences, has developed advanced methodologies that can be used to estimate the total number of victims from such incomplete datasets.
 
Over the decades, statisticians have developed a broad set of methods for this problem, which is known as population size estimation. For example, capture-recapture methods estimate the population size by examining the intersection between datasets from different sources. Intuitively, if two independently-collected lists have few records in common, then we expect a major under-reporting. In this case, the total population size will probably be much larger than the number of reported individuals.
 
To apply this capture-recapture approach to the estimation of war casualties, we first need to identify records referring to the same individual across multiple databases, a procedure known as record linkage or entity resolution. When data are potentially inaccurate and unique identifiers are not available, this task is far from trivial and requires a statistical approach. In particular, Bayesian methods are valuable in quantifying uncertainty on the record matching and hence on subsequent estimates, in this case on the estimated number of victims.
 
Together with an international network of coauthors, Professor Zanella has contributed to the development of Bayesian methods for entity resolution, from both a theoretical and computational point of view.
 
«Entity resolution», explains Zanella, «can be seen as a clustering task, with clusters consisting of records associated to the same person. In this context, the number of records in each cluster tends to be extremely small compared to the size of the dataset. For example, one might have hundreds of thousands of records partitioned in clusters containing at most five records each. Such a microclustering behaviour is not well captured by traditional Bayesian clustering models, which assume that each cluster contains a non-negligible fraction of the whole population».
 
This has motivated Zanella to propose new models for microclustering, study their theoretical properties and apply them to entity resolution. Moreover, since traditional computational techniques performed poorly on this new class of models, he has developed and analyzed novel Markov chain Monte Carlo algorithms that have proven to be orders of magnitude more efficient in exploring the discrete space of record linkage configurations. This opens up the possibility of performing Bayesian microclustering with big data, not only in entity resolution, but also in DNA sequencing, language processing and sparse network analysis, among other applications.
 
«This project exemplifies my research activity», says Zanella, «which is aimed at a rigorous mathematical understanding of modern statistical and computational methods motivated by real-world applications, in order to develop more effective and reliable methodologies».
 
Find out more
 
B. Betancourt, G. Zanella and R. Steorts. Random partition models for microclustering tasks, under revision.

G. Zanella, Informed proposals for local MCMC in discrete spaces, in  Journal of the American Statistical Association (T&M), in press.

G. Zanella, B. Betancourt, H. Wallack, J. Miller, A. Zaidi and R. Steorts, Flexible models for microclustering with application to entity resolution, in Advances in Neural Information Processing Systems 29 (NIPS 2016).

by Sirio Legramanti
Bocconi Knowledge newsletter

News

  • Providers of Long Term Care for the Elderly Must Evolve

    The latest report on this sector by the Cergas research center and Essity has been released  

  • Bocconi Postdoc Invited to High Profile Conference

    Gianluigi Riva joins a selected group of young scientists that will attend a meeting with Nobel laureates later this year  

Seminars

  September 2021  
Mon Tue Wed Thu Fri Sat Sun
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30      

Seminars

  • EXITING THE ENERGY CHARTER TREATY UNDER THE LAW OF TREATIES
    Bocconi Conversations in International Law

    ROGER MICHAEL O'KEEFE - Università Bocconi
    LORAND BARTELS - University of Cambridge
    TIBISAY MORGANDI - Queen Mary University of London

    Seminar Room 1.C3-01

  • Dave Donaldson - Putting Quantitative Models to the Test: An Application to Trump's Trade War

    DAVE DONALDSON - MIT

    Alberto Alesina Seminar Room 5.e4.sr04, floor 5, Via Roentgen 1