Cross-lingual supervised classification of political texts

Hauke Licht

University of Zurich, Department of Political Science

Large portions of political text collections are multilingual and principally invite comparative quantitative analysis. However, established methods for cross-lingual text analysis require reliance on linguistically qualified human coders, human translators, or reliable machine translation and thus tend to thwart comparative research. In this paper, I propose an alternative method that relies on multilingual text embedding: Texts written in different languages are embedded in a joint semantic space using a publicly available multilingual language model. The resulting text embeddings are then used as inputs to train a supervised machine learning classifier. To validate the proposed approach, I conduct a series of text classification experiments on three different political text corpora. These experiments show that classifiers trained on multilingual text embeddings pass three important tests: They classify held-out texts as accurately as comparable classifiers trained on monolingual or translated texts. They perform by and large consistently across languages. And they classify texts written in languages that were not present among the training data with little to no loss in predictive performance. Viewed together, these results present supervised classification from multilingual text embeddings as a reliable, replicable, and cost-efficient approach to multilingual text classification. This study thus contributes to an emerging methodological literature on multilingual quantitative text analysis in political science.