Cross-Domain Classification Of Political Texts: Introducing A Lean And Versatile Two-Step Workflow

PS6-2

Presented by: Sebastian Block

Sebastian Block ¹, Martin Gross ¹, Dominic Nyhuis ², Jan Velimsky ¹

¹ LMU Munich
² University of North Carolina at Chapel Hill

The automatic classification of political documents is a flourishing research field in computational social science. So far, research has mostly dealt with automatically classifying documents for one specific document type. Even though such an approach is more cost-efficient compared to manually coding the entire corpus, it still requires considerable human resources to create a training dataset for every set of documents at hand. An alternative strategy to reduce manual labor is to train a classifier on already existing data and use that classifier to code a virgin corpus of another set of documents. We propose an innovative resource-efficient two-step workflow to classify documents from a case that the classifier was not trained on. We use the Multiclass Ensemble Cutoff Classifier Approach (MECCA) as the foundation and add a second iteration step. We demonstrate how the outlined two-step workflow functions in practice by using it to label council questions from the German local level with a classifier trained on parliamentary questions from the German Bundestag using the Comparative Agenda Project’s coding scheme. The Bundestag dataset consists of 22,181 manually coded documents. The local-level dataset comprises 14,640 questions from the most recent full legislative period of 31 German cities with over 100,000 inhabitants. Our results show that the classifier correctly predicts more than 70 percent of all council questions with a precision above 80 percent and is therefore on par with human coding.