Finding the Perfect Number
NEWS |

Finding the Perfect Number

FRANCESCO GROSSETTI HAS RELEASED A COMPUTER PACKAGE CAPABLE OF FINDING OUT THE NUMBER OF TOPICS THAT BEST DESCRIBES THE CONTENT OF CORPUSES OF TEXTS TOO LARGE TO BE MANAGED

If you were tasked with determining what a corpus of 200,000 pages of documents is about, you would face two challenges. The first, of course, is to identify the topics covered in a number of pages difficult to manage manually. The second is to decide how many topics to identify in order to give an answer that is neither reductive (it is unlikely, for example, that three topics would give a minimally exhaustive idea of the topics covered in 200,000 pages), nor unmanageable (with 3,000 topics we would probably be exhaustive, but difficult to interpret).
 
One of the best solutions to the problem of topic identification is the Latent Dirichlet Allocation (LDA) technique, developed in 2003. Based on it, Francesco Grossetti (Department of Accounting) and Craig Lewis (Vanderbilt University) now propose a solution to the identification of the optimal number of topics through a scientific paper (“A Statistical Approach for Optimal Topic Model Identification”, preprint) and OpTop, a package that implements the methodology.
 
“What we present,” Grossetti says, “is a statistical test, which works irrespective of the context and meaning of topics. In technical terms, each topic is an ordered collection of all the words contained in the corpus, whose order represents their importance within a particular topic. It’s up to the researcher who uses this tool to interpret the answers, assigning a label to each topic and choosing to merge topics that are very close in meaning, if appropriate.”
 
For his part, Grossetti has already made use of the technique - and the consequent use of interpretive judgment - in a paper on financial disclosure, which identifies the risk factors made explicit by companies in their financial statements.
 


by Fabio Todesco
Bocconi Knowledge newsletter

People

  • Kapacinskaite Nominated Among Top 5 for Two Dissertation Awards at AOM

    The Academy of Management leads the discussion on the world's most prominent organizational and management issues  

  • Catherine De Vries in the 50 Influential Researchers List by Apolitical Foundation

    A list of scholars from around the world whose research could help cultivate reflective, representative, and informed politicians  

Seminars

  August 2022  
Mon Tue Wed Thu Fri Sat Sun
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        

Seminars

  • ELLIS@Milan Artificial Intelligence workshop

    GABOR LUGOSI - Department of Economics, Pompeu Fabra University
    RICARDO BAEZA-YATES - Khoury College of Computer Sciences Northeastern University
    NOAM NISAN - School of Computer Science and Engineering, Hebrew University of Jerusalem
    MICHAL VALKO - Institut national de recherche en sciences et technologies du numérique

    AS02 DEUTSCHE BANK - Roentgen building

  • tbd

    ANDREW KING - Questrom School of Business

    Meeting room 4E4SR03 (Roentgen) 4