Neural Networks Capture the Shades of Our Online LanguagesIN HIS RECENT WORK, COMPUTATIONAL SOCIOLINGUIST DIRK HOVY ANALYZES MILLIONS OF SOCIAL MEDIA POSTS WITH A MACHINE LEARNING ALGORITHM THAT LETS US TRACK AND VISUALIZE LINGUISTIC VARIATIONS THROUGH COLORED MAPS
A new machine learning technique allows us to capture language and dialect variations and their evolution through the analysis of what people write on social media.
In two recent works, Dirk Hovy, a computational sociolinguist and Associate Professor at Bocconi’s Department of Marketing, deploys an innovative method to process large amounts of social media data to capture gradual differences in language variations. The method provides a clear visual reference (a map) that can serve as input for further qualitative studies. It also has direct applications for user profiling (finding out where a social media user is located, as demonstrated in a third paper). It has consequences more broadly for the personalization of text analysis tools, and for making them more robust to language variation, an important step in addressing the growing issue of algorithmic bias.
The algorithm uses a neural network technique to learn patterns from data. At the beginning, the algorithm doesn’t know anything about languages, but it observes linguistic similarities in the geotagged data, and puts them all in a 100-dimensional space. The dimensions don’t have an intuitive, interpretable meaning, but only mark the distances between datapoints as understood by the neural network. Within this space, the algorithm learns to arrange words and phrases according to their meaning (words with similar meaning are arranged closer together).
Once the algorithm is finished, the complexity can be mathematically reduced in order to visualize the data, moving from a 100-dimensional to a three-dimensional representation. Each dimension is then conventionally defined as a quantity of red, green and blue and every point is represented as mixture of these three colors. The values 0.5, 0.5 and 0.5, for example, correspond to a medium gray.
In the first study, for a chapter of a forthcoming book, Prof. Hovy makes use of 95 million geotagged tweets to draw a map of linguistic variation across Europe, as well as maps of individual countries.
The European map shows that the method clearly acknowledges linguistic families, with Romance, Germanic, and Slavic languages neatly distinguished by hue, as well as several intra-national boundaries: Belgium is divided along a horizontal line (Dutch-speaking in the north, French-speaking in the south), whereas Switzerland and part of Northern Italy (both with German, French, and Italian speakers) mark a smoother transition. The British Isles’ hue highlights the influx of Romance languages on a Germanic root. In former Yugoslavia, Slovenia and Bosnia Herzegovina seem to partly depart from the Slavic linguistic tradition, perhaps marking in this way emerging social and religious fault lines.
«The method we use is empirical, with language and individual informants transformed into numbers, cells, and colors, but the results enable new and surprising insights into regional language variation», Prof. Hovy says. It can therefore easily be applied to new samples and languages.
A very similar methodology is used by Prof. Hovy to study linguistic similarities and differences between cities of German-speaking countries in Europe (Germany, Switzerland, and Austria). This time, the neural network was used on 2.3 million conversations (or 16.8 million posts) on Jodel, an anonymous mobile chat application. Even though people usually post in High German, the resulting maps show a dialect gradient from north to south, with Switzerland as a separate entity, confirming what we know about German dialects. However, at the same time it also highlights how sociodemographic changes are affecting the language. The city of Würzburg, for example, is a Bavarian city that seems not to speak a Bavarian, but a Western dialect, due to the influence of a large college student population coming from the Western parts of Germany.
The study’s findings directly contradict the common perception that dialects are disappearing in modern life. While they do not distinguish individual towns any more, it shows that dialects are becoming more entrenched at a larger regional level, even on anonymous social media platforms, where people should have little reason to mark their origin. This finding has economic ramifications, too: recent studies have shown that people prefer commuting longer to stay within their dialect region, rather than seeking jobs in a closer town in a different dialect area.
Christoph Purschke and Dirk Hovy, Lörres, Möppes, and the Swiss. (Re)Discovering Regional Patterns in Anonymous Social Media Data, forthcoming in Journal of Linguistic Geography.
Dirk Hovy, Afshin Rahimi, Timothy Baldwin, Julian Brooke, Visualizing Regional Language Variation Across Europe on Twitter, in Stanley D. Brunn, Roland Keherein (eds.), Handbook of the Changing World Language, Springer, 2019.
by Fabio Todesco