Supervised Multimodal Classification for Political Science Research
P7-S186-5
Presented by: Andreu Casas
Political scientists often study multimodal data containing both text and visuals, such as news articles, information from websites, and social media posts. A common practice for large-scale projects is to use supervised machine learning to identify theoretical concept of interest in the data – by manually annotating a subset of the data, using the annotations to train a machine learning classifier, and using the trained classifier to predict the concept in the remaining unannotated data. Past research has mostly used textual input features when training supervised classifiers for multimodal data, yet recent advances facilitate leveraging both text and visual features. However, little is known about whether, and the conditions under which, multimodal approaches can help boost performance and improve measurement. In this paper we use two original dataset of 4,000 annotated YouTube videos, and 4,000 annotated X posts, to compare the performance, across 10 annotation tasks, of a range of supervised classifiers: text-only (SVM, BERT, Llama2, Llama3), image-only (CNN), and multimodal (CLIP, Idefics2). We find a Visual Language Model (Idefics2) to outperform all text-only and image-only models on most annotation tasks, including a state-of-the-art Large Language Model (Llama3) fine-tuned for the tasks at hand.
Keywords: multimodal modeling; computational methods; supervised classifier; large visual language model; large language model