Deep Learning Advances for Social Scientists
NEWS |

Deep Learning Advances for Social Scientists

BOCCONI STUDENTS ON THE SEMINARS ORGANIZED BY IGIER. IN THE LATEST ARTICLE OF THE SERIES, ANTONELLA BUCCIONE REPORTS ON THE WORK OF MELISSA DELL, HARVARD UNIVERSITY

Leading international scholars present their cutting-edge research at Bocconi every year, in front of faculty and students. In order to make this work accessible to a larger audience, Bocconi Knowledge publishes the summaries of the scientific and policy seminars organized by the IGIER research center, written by the students participating in the IGIER-BIDSA Visiting Students Initiative.
 
Suppose that you wanted to understand how state control affects firm productivity and you have an archive with many books containing highly detailed information about state-controlled firms in Japan. Hiring people to collect the data, spanning thousands of pages, is way too costly and impractical.  Using standard available Optical Character Recognition (OCR) tools on the scanned pages is also infeasible, for the output would be too imprecise and hence hard to use.  Must you abandon the project?  The question is very relevant in social sciences, in which large amounts of potentially useful data is stored in extensive archives that cannot be systematically collected and analysed using standard techniques. To unleash this information, we need technological innovation! 
 
On the 16th March installment of the 2020-2021 IGIER Seminar Series, Professor Melissa Dell, from Harvard University, explained how Artificial Intelligence techniques, in particular Deep Learning, can be a crucial asset in this endeavor, and talked about her work geared at making such tools accessible and available for social science research.
 
Deep learning makes use of complex architectures called neural networks, that were initially developed to mimick the way a brain actually learns. Compared to traditional methods, in deep learning the computer learns the decision rules on its own, which makes the results more robust to noisy data and more easily generalizable.
 
Professor Dell argued that in social science these methods can be useful assets to work with traditional data sources, but especially to unlock the potential coming from sources that were previously deemed unfeasible to analyse.  
 
Imagine, for instance, that you wanted to study the evolution of political ideologies, using information contained in historical newspapers. First of all, conventional OCR methods would be highly ineffective: they often fail when faced with complex or unusual page layouts and they may extract text that is incomplete, at best, and nonsensical, at worst (by mixing different sections and columns together). Such inaccurate retrieval would then only allow for simple queries, such as keyword search, and would make it impossible to carry out more sophisticated language and topic modelling tasks.  
 
Dell argued, instead, that customized OCR, using neural networks, can be fundamental to be able to accurately extract text from scanned pages. Once we have our full text in machine-readable form, state-of-the-art and fully open-source models (like BERT and RoBERTa, by Google and Facebook) can achieve incredibly accurate performance in natural language understanding.
 
Another domain of interest is the case, discussed at the outset, of historical disaggregated data. For economists, it is a widely accepted reality that many research questions can only be understood, with sufficient granularity, by using microeconomic data. It is however hard to find disaggregated data, covering long enough periods of time, in digitized format.
 
Dell showed that in this case one can use Generative Adversarial Networks (GANs) to “clean-up” noisy scans, warped or worn out by time. Then, to extract different elements, such as from accounting records, one can rely on object detection models based on Convolutional Neural Networks (CNN). Lastly, it is possible to train customized OCR engines, for instance using Encoder-Decoder architectures, when dealing with unusual fonts that are not recognized by commercial OCR tools.
 
Deep learning can thus be a valuable addition to every step of the curation pipeline: from the layout analysis and pre-processing to the actual task of natural language processing. And, although these problems seem to be entirely different from each other, the techniques actually used to tackle them are remarkably similar.
 
These methods have large advantages not only relative to existing off-the-shelf tools, but also to manual data curation.  Besides its large cost, in fact, manual data entry is also prone to errors, since it often relies on commercial OCR software as a first pass. It may even be simply unfeasible for very large datasets, as is the case for the incredibly valuable disaggregated microeconomic data. Deep learning can of course also be costly, but after the initial investment in computing resources and human capital, it scales very well. Automated data curation can thus help democratize the access to data and empirical research, and it can allow social scientists to raise new questions and study new contexts, such as those of lower-income countries for which data is available in physical form but is seldom digitized.
 
In order to make these tools more accessible to social scientists, unaccustomed with the field of computer science, Professor Dell and her group have been working in two directions. First, by providing open-source toolkits for researchers, to carry out layout detection, OCR and NLP tasks, in a way that is as user friendly as possible for any Python user. Then, she has been working on sharing notes and resources to allow for a gentler introduction to the field of neural networks, which is extremely vast and fast-paced. 
 
She concluded by restating how deep learning can be key in unlocking massive information currently trapped in text and image data and encouraging researchers to become familiar with such methods and how they can be applied to social science.

by Antonella Buccione
Bocconi Knowledge newsletter

People

  • Adam Eric Greenberg Makes Top List

    A paper on the psychological factors at play in the decision to claim retirement benefits in the US was in the final selection for the AMA's Paul E. Green Award  

  • Graziella Romeo Joins Top Academic Journal

    The International Journal of Constitutional Law has a new Associate Editor.  

Seminars

  April 2024  
Mon Tue Wed Thu Fri Sat Sun
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          

Seminars

  • THE FAILURE TO PREVENT FRAUD IN THE UK CORPORATE ENVIRONMENT
    Seminar of Crime Law

    NICHOLAS RYDER - Cardiff University

    Room 1-C3-01, Via Roentgen 1

  • Clare Balboni - Firm Adaptation in Production Networks: Evidence from Extreme Weather Events in Pakistan

    CLARE BALBONI - LSE

    Alberto Alesina Seminar Room 5.e4.sr04, floor 5, Via Roentgen 1