Counting the Invisible
PEOPLE |

Counting the Invisible

ESTIMATING THE NUMBER OF WAR VICTIMS THROUGH INCOMPLETE REPORTS IS A CHALLENGING STATISTICAL PROBLEM THAT GIACOMO ZANELLA HAS TACKLED FROM BOTH A THEORETICAL AND A COMPUTATIONAL POINT OF VIEW

Even in the big data era, access to information can be limited by a variety of factors, ranging from political to practical. This is also the case of casualty records in war scenarios, which often consist of multiple, incomplete and potentially inaccurate lists (for example from different NGOs) instead of a unique and exhaustive official registry. Giacomo Zanella, an Assistant Professor in Statistics at Bocconi Department of Decision Sciences, has developed advanced methodologies that can be used to estimate the total number of victims from such incomplete datasets.
 
Over the decades, statisticians have developed a broad set of methods for this problem, which is known as population size estimation. For example, capture-recapture methods estimate the population size by examining the intersection between datasets from different sources. Intuitively, if two independently-collected lists have few records in common, then we expect a major under-reporting. In this case, the total population size will probably be much larger than the number of reported individuals.
 
To apply this capture-recapture approach to the estimation of war casualties, we first need to identify records referring to the same individual across multiple databases, a procedure known as record linkage or entity resolution. When data are potentially inaccurate and unique identifiers are not available, this task is far from trivial and requires a statistical approach. In particular, Bayesian methods are valuable in quantifying uncertainty on the record matching and hence on subsequent estimates, in this case on the estimated number of victims.
 
Together with an international network of coauthors, Professor Zanella has contributed to the development of Bayesian methods for entity resolution, from both a theoretical and computational point of view.
 
«Entity resolution», explains Zanella, «can be seen as a clustering task, with clusters consisting of records associated to the same person. In this context, the number of records in each cluster tends to be extremely small compared to the size of the dataset. For example, one might have hundreds of thousands of records partitioned in clusters containing at most five records each. Such a microclustering behaviour is not well captured by traditional Bayesian clustering models, which assume that each cluster contains a non-negligible fraction of the whole population».
 
This has motivated Zanella to propose new models for microclustering, study their theoretical properties and apply them to entity resolution. Moreover, since traditional computational techniques performed poorly on this new class of models, he has developed and analyzed novel Markov chain Monte Carlo algorithms that have proven to be orders of magnitude more efficient in exploring the discrete space of record linkage configurations. This opens up the possibility of performing Bayesian microclustering with big data, not only in entity resolution, but also in DNA sequencing, language processing and sparse network analysis, among other applications.
 
«This project exemplifies my research activity», says Zanella, «which is aimed at a rigorous mathematical understanding of modern statistical and computational methods motivated by real-world applications, in order to develop more effective and reliable methodologies».
 
Find out more
 
B. Betancourt, G. Zanella and R. Steorts. Random partition models for microclustering tasks, under revision.

G. Zanella, Informed proposals for local MCMC in discrete spaces, in  Journal of the American Statistical Association (T&M), in press.

G. Zanella, B. Betancourt, H. Wallack, J. Miller, A. Zaidi and R. Steorts, Flexible models for microclustering with application to entity resolution, in Advances in Neural Information Processing Systems 29 (NIPS 2016).

by Sirio Legramanti

News

All News
  • Quantum Physics and Statistical Physics for Machine Learning Meet at Bocconi

    In the early days of next week the University will virtually host 300 participants of the ELLIS Workshop on Quantum and Physics Based Machine Learning  

  • Two Generations of Bayesian Statisticians Meet at BayesLab Webinars

    The Bocconi research unit on Bayesian statistics will host an online seminar series in which outstanding young researchers will have the opportunity to present their work and get feedback from toplevel senior scholars  

Seminars

  July 2020  
Mon Tue Wed Thu Fri Sat Sun
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

Seminars      

All Seminars
  • RELAZIONE SULL'ATTIVITA' SVOLTA DALL'ARBITRO BANCARIO FINANZIARIO NEL 2019
    Private law

    COORDINA: PIETRO SIRENA Direttore della Scuola di Giurisprudenza PRESENTANO LA RELAZIONE: MAGDA BIANCO Capo Dipartimento tutela dei clienti ed educazione finanziaria, Banca d’Italia; MARGHERITA CARTECHINI Dipartimento tutela dei clienti ed educazione finanziaria, Banca d’Italia; NE DISCUTONO: FRANCESCO GIAVAZZI Senior Professor dell'Università Bocconi; SABINO CASSESE Giudice emerito della Corte Costituzionale e professore emerito della Scuola Normale Superiore di Pisa. CONCLUDONO: ALESSANDRA PERRAZZELLI Vice direttrice generale della Banca d’Italia; MARCO VENTORUZZO Direttore del Dipartimento di Studi giuridici

    Webinar

  • Estimating the impact of Airbnb on the local economy: Evidence from the restaurant industry

    DAVIDE PROSERPIO, USC

    Webinar