Neural Networks Capture the Shades of Our Online Languages

Neural Networks Capture the Shades of Our Online Languages


A new machine learning technique allows us to capture language and dialect variations and their evolution through the analysis of what people write on social media.
In two recent works, Dirk Hovy, a computational sociolinguist and Associate Professor at Bocconi’s Department of Marketing, deploys an innovative method to process large amounts of social media data to capture gradual differences in language variations. The method provides a clear visual reference (a map) that can serve as input for further qualitative studies. It also has direct applications for user profiling (finding out where a social media user is located, as demonstrated in a third paper). It has consequences more broadly for the personalization of text analysis tools, and for making them more robust to language variation, an important step in addressing the growing issue of algorithmic bias.
The algorithm uses a neural network technique to learn patterns from data. At the beginning, the algorithm doesn’t know anything about languages, but it observes linguistic similarities in the geotagged data, and puts them all in a 100-dimensional space. The dimensions don’t have an intuitive, interpretable meaning, but only mark the distances between datapoints as understood by the neural network. Within this space, the algorithm learns to arrange words and phrases according to their meaning (words with similar meaning are arranged closer together).
Once the algorithm is finished, the complexity can be mathematically reduced in order to visualize the data, moving from a 100-dimensional to a three-dimensional representation. Each dimension is then conventionally defined as a quantity of red, green and blue and every point is represented as mixture of these three colors. The values 0.5, 0.5 and 0.5, for example, correspond to a medium gray.
In the first study, for a chapter of a forthcoming book, Prof. Hovy makes use of 95 million geotagged tweets to draw a map of linguistic variation across Europe, as well as maps of individual countries.
The European map shows that the method clearly acknowledges linguistic families, with Romance, Germanic, and Slavic languages neatly distinguished by hue, as well as several intra-national boundaries: Belgium is divided along a horizontal line (Dutch-speaking in the north, French-speaking in the south), whereas Switzerland and part of Northern Italy (both with German, French, and Italian speakers) mark a smoother transition. The British Isles’ hue highlights the influx of Romance languages on a Germanic root. In former Yugoslavia, Slovenia and Bosnia Herzegovina seem to partly depart from the Slavic linguistic tradition, perhaps marking in this way emerging social and religious fault lines.
«The method we use is empirical, with language and individual informants transformed into numbers, cells, and colors, but the results enable new and surprising insights into regional language variation», Prof. Hovy says. It can therefore easily be applied to new samples and languages.
A very similar methodology is used by Prof. Hovy to study linguistic similarities and differences between cities of German-speaking countries in Europe (Germany, Switzerland, and Austria). This time, the neural network was used on 2.3 million conversations (or 16.8 million posts) on Jodel, an anonymous mobile chat application. Even though people usually post in High German, the resulting maps show a dialect gradient from north to south, with Switzerland as a separate entity, confirming what we know about German dialects. However, at the same time it also highlights how sociodemographic changes are affecting the language. The city of Würzburg, for example, is a Bavarian city that seems not to speak a Bavarian, but a Western dialect, due to the influence of a large college student population coming from the Western parts of Germany.
The study’s findings directly contradict the common perception that dialects are disappearing in modern life. While they do not distinguish individual towns any more, it shows that dialects are becoming more entrenched at a larger regional level, even on anonymous social media platforms, where people should have little reason to mark their origin. This finding has economic ramifications, too: recent studies have shown that people prefer commuting longer to stay within their dialect region, rather than seeking jobs in a closer town in a different dialect area.

Christoph Purschke and Dirk Hovy, Lörres, Möppes, and the Swiss. (Re)Discovering Regional Patterns in Anonymous Social Media Data, forthcoming in Journal of Linguistic Geography.
Dirk Hovy, Afshin Rahimi, Timothy Baldwin, Julian Brooke, Visualizing Regional Language Variation Across Europe on Twitter, in Stanley D. Brunn, Roland Keherein (eds.), Handbook of the Changing World Language, Springer, 2019.

by Fabio Todesco


All News
  • Quantum Physics and Statistical Physics for Machine Learning Meet at Bocconi

    In the early days of next week the University will virtually host 300 participants of the ELLIS Workshop on Quantum and Physics Based Machine Learning  

  • Two Generations of Bayesian Statisticians Meet at BayesLab Webinars

    The Bocconi research unit on Bayesian statistics will host an online seminar series in which outstanding young researchers will have the opportunity to present their work and get feedback from toplevel senior scholars  


  April 2020  
Mon Tue Wed Thu Fri Sat Sun
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30      


All Seminars
    Private law

    COORDINA: PIETRO SIRENA Direttore della Scuola di Giurisprudenza PRESENTANO LA RELAZIONE: MAGDA BIANCO Capo Dipartimento tutela dei clienti ed educazione finanziaria, Banca d’Italia; MARGHERITA CARTECHINI Dipartimento tutela dei clienti ed educazione finanziaria, Banca d’Italia; NE DISCUTONO: FRANCESCO GIAVAZZI Senior Professor dell'Università Bocconi; SABINO CASSESE Giudice emerito della Corte Costituzionale e professore emerito della Scuola Normale Superiore di Pisa. CONCLUDONO: ALESSANDRA PERRAZZELLI Vice direttrice generale della Banca d’Italia; MARCO VENTORUZZO Direttore del Dipartimento di Studi giuridici


  • Estimating the impact of Airbnb on the local economy: Evidence from the restaurant industry