Revamping ConfliBERT. A Pre-Trained Language Model for Political Conflict and Violence

Submission 376

Panel.2-S-5

Presented by: Javier Osorio

Javier Osorio

University of Arizona

ConfliBERT is a domain-specific pre-trained Large Language Model (LLM) specifically designed for analyzing text related to conflict and political violence. Since its deployment in 2022, ConfliBERT has become a powerful tool regularly used by computational conflict scholars. Various independent research projects have validated the way ConfliBERT generally outperforms both encoder-only and encoder-decoder LLMs in a variety of complex text analysis tasks relevant to conflict research. Despite its contributions, the original architecture used to develop ConfliBERT is now outdated. This paper presents ConfliBERT.2.0 in both English and Spanish. This new development yields a more powerful specialized LLM for conflict research. ConfliBERT.2.0 comprises several innovations including a much larger domain-specific training corpus (four times larger than the original ConfliBERT), refined masking strategies and contrastive objectives for pre-training, rotary position embeddings (RoPE) that substitute BERT’s old positional embeddings, an enhanced depth-to-width ratio that extends context length to 4,096 tokens rather than the 512 tokens in the original BERT architecture. Based on a diverse set of downstream tasks in both English and Spanish, the analysis shows that ConfliBERT.2.0 outperforms its previous version and various generic LLMs including encoder-only models and generative AI tools.