Comparing Large Language Models for Text Classification: Model Selection Across Tasks, Texts, and Languages

P7-S186-3

Presented by: Michael Heseltine

Michael Heseltine

University of Amsterdam

Large-scale text analysis has grown rapidly as an analytic method in the social sciences and beyond, in recent years. To date, text-as-data methods rely on large volumes of human-annotated training examples, which places a premium on researcher resources. However, advances in large language models (LLMs) have made automated annotation increasingly viable. This paper tests the performance of 12 different LLMs in text classification across different tasks, text types, and languages. Using data in six languages across eight country contexts, the results show considerable variation in model performance, highlighting that researchers should carefully consider model selection as part of their LLM-centered classification strategy. In general, GPT-4 exhibits relatively strong performance across all classification tasks, while open-source alternatives such as LLama3 and Qwen2 also show similar or even superior performance on select tasks. However, many open-source models provide relatively unsatisfactory performance on more complex and non-English language coding tasks. The tradeoffs inherent in the use of each model are highlighted to allow researchers to make informed decisions about model selection on a specific task-by-task basis.

Keywords: Large language models, text as data, political communication

Sponsors