Scaling up Data Science: Turning an Art into Science

Scaling up Data Science: Turning an Art into Science


In the “good ol’ days” – if they have ever existed – we used to have models aimed at explaining phenomena via a limited number of parameters and we could test them using a small amount of data. When we collected new data, we only had to feed them into the model and compute the outcome.
Nowadays, statistical and machine learning models can have millions of parameters and we can collect billions of heterogeneous datapoints coming from different sources, and no computer in the world is able to process such quantities in a reasonable amount of time. That’s what computational algorithms are for: they are processes that come to around the same results of the original model, but in a simpler and faster way.
There are some issues, though. We don’t always exactly understand why a computational algorithm works and, if it does, we can’t be sure it will work as well with different or considerably larger datasets.
“This lack of understanding results in the routine use of inefficient and largely suboptimal algorithms, and makes the design of efficient algorithms for practically used models something of an art,” said Giacomo Zanella, Assistant Professor at Bocconi Department of Decision  Sciences.
Zanella obtained a €1.5mln ERC Starting Grant from the European Research Council (ERC) to better understand computational algorithms for large-scale probabilistic models, thus making their design not an art, but a science. The project (PrSc-HDBayLe - Provable scalability for high-dimensional Bayesian Learning) aims to single out the most promising algorithms using rigorous and innovative mathematical techniques,  and to produce guidelines to improve them and  develop new ones.

The algorithms Zanella studies have three properties: they are commonly used (“I want to develop knowledge relevant to practitioners,” he said), provably scalable and reliable. In a scalable algorithm, the computer time needed to produce a result increases only in a linear way, i.e. in the same proportion of datapoints or parameters: twice the data, twice the time. Such algorithms promise to stay manageable even if the number of parameters and datapoints continues to increase.
Reliability can only be guaranteed by a correct understanding of an algorithm workings. This includes providing a rigorous quantification of the uncertainty associated to the result of the analysis, as commonly done in Bayesian statistical models, which will be the focus of the project.
“My field is Computational Statistics,” Zanella said, “an intrinsically interdisciplinary field at the interface of Statistics, Machine Learning and Applied Mathematics. My research approach is at the intersection of methodology (designing algorithms that are both scalable and reliable) and theory (proving they are scalable).”
The results of the project will help deal with the statistical and computational challenges due to high-dimensionality (the increasing number of features recorded per individual); potential presence of interactions (the virtually infinite combinations of features that could influence the actual outcome); missing data and sampling bias; and the need to combine data from different sources (e.g. multiple databases with various degree of reliability; individual vs aggregated level data; etc).
These challenges routinely arise in real-life data science problems, with examples ranging from estimating the number of war victims through incomplete reports to predicting election outcomes combining different sources of big, wide, and dirty data.
Link to related stories. Image: Knowledge that Matters. Story headline: Bocconi Nets Nearly €6m in European Research Funding Link to related stories. Image: a family and a wheelchair. Story headline: How Families Are Affected by a Disabled Child Link to related stories. Image: computers and graphs. Story headline: The Hidden Workings of Overlooked OTC Markets

The ERC Starting Grants can be assigned to talented early-career researchers of any nationality with 2-7 years of experience since completion of their PhD, a scientific track record showing real promise and an excellent research proposal. For researchers in the subsequent stages of their career, the ERC provides Consolidator Grants and Advanced Grants.

by Fabio Todesco
Bocconi Knowledge newsletter


  • A New Episode of Bocconi's Podcast on Cybersecurity

    A talk about the meaning of cyberwar and what threats different world regions face  

  • The War Economy in Ukraine

    IGIER Visiting Student Daniel Nicolae Paraschiv reports on a seminar with Tymofiy Mylovanov, Yuriy Gorodnichenko, Pierre Olivier Gourinchas and Tito Boeri  


  January 2023  
Mon Tue Wed Thu Fri Sat Sun
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          


  • Fairness in machine learning: a study of the Demographic Parity constraint

    NICOLAS SCHREUDER - Universita' di Genova

    Seminar Room 3-e4-sr03, 3rd floor, Via Roentgen 1

  • Rouzi Song: Pollution Taxes as a Second-Best: Accounting for Multidimensional Firm Heterogeneity in Environmental Regulations

    ROUZI SONG - University of Southern California