The Anatomy of a Villain: Distant Reading and Graph RAG for VillainScope

Submission 59

SP02-03

Presented by: Gayeon Kim

Gayeon Kim

Department of Cultural Informatics, Graduate School of Korean Studies, The Academy of Korean Studies, Republic of Korea

Villains have evolved across diverse media and genres, reflecting the values of their respective eras. In contemporary narratives, villains are depicted not merely as antagonists but as independent characters endowed with complex psychological developments and background stories. Moving beyond the binary classification of good and evil, they often embody moral ambiguity, which evokes deep empathy and identification among audiences and serves as a central axis of the narrative. However, existing research on villains has largely been confined to close readings, focusing on individual works’ character traits or backstories. Narrative datasets such as BookCorpus [1], ROCStories [2], and PropperLerner [3] have included attributes related to genre or historical background in order to understand specific narrative structures. In particular, character-oriented datasets such as LiSCU [4], BOOKWORM [5], and CHATTER [6] have provided descriptions of character traits and summaries primarily for the purpose of understanding character-centered narratives. These datasets, however, have shown limitations in illuminating the universal features of villains and the ways in which their roles have evolved in line with broader cultural and historical trends across contemporary media.

To address these limitations, this study introduces VillainScope, a dataset designed to systematically analyze villain characters. VillainScope compiles structured information on 209 villains drawn from four major media categories: animation, literature, film, and television drama. Detailed descriptions of villains’ attributes and metadata are made available through the project’s GitHub repository**. During data collection, objectivity was prioritized by grounding descriptions in scholarly references and precise documentation.

To identify villains’ universal characteristics and narrative roles, three topic modeling approaches were employed: Latent Dirichlet Allocation (LDA), Embedded Topic Model (ETM), and Contextualized Topic Model (CTM). Among these, ETM demonstrated superior performance in terms of topic coherence and diversity, effectively capturing meaningful themes across villain typologies. Notably, ETM revealed that the motives for villainous actions often stem not only from sheer malevolence but also from complex factors such as betrayal or the loss of power. This finding highlights the potential of exploring the contextual layers underlying villainous behavior. Furthermore, statistical analyses of attribute frequencies across different time periods were conducted to examine how villains have evolved in response to shifting societal values.

The topic modeling results indicate that villains commonly share traits such as high intellectual capacity, advanced technical expertise surpassing ordinary individuals, mastery of disguise, and involvement in criminal activities. Disguise was found to encompass not only alterations in appearance or costume but also manipulations of language and social roles. Villains with psychological issues, including dissociative identity disorder or depression, often channeled their feelings of alienation and helplessness into violent expressions. In some cases, such issues escalated into delusions that allowed them to rationalize their actions. Villains frequently exhibited dual and contradictory psychological traits—for instance, loyalty to a specific organization coupled with extreme cruelty toward others—thereby emerging not as flat characters but as multidimensional, round characters.

Building upon this foundation, the study further proposes extending VillainScope from its current plain-text format into a knowledge graph structured through triples. This transformation facilitates the mapping of villains’ attributes to broader metadata such as works, genres, and directors, enabling complex relational analyses. Integration with external resources like Wikidata would allow for scalable data enrichment and clearer delineation of inter-entity relationships. This approach provides a foundational layer for graph-based Retrieval-Augmented Generation (RAG), offering not only villain-related textual resources but also high-quality, structured data. In doing so, it mitigates hallucinations in language models and unlocks new analytical avenues for character studies from a distant reading perspective. For example, researchers could trace the narrative evolution of villains across directors’ works or identify cross-media patterns of villains sharing similar attributes, thus moving beyond descriptive statistics toward relational insights.

In conclusion, this study constructs the VillainScope dataset and applies topic modeling to uncover villains’ universal characteristics. By overcoming the limitations of prior qualitative research, it introduces a novel methodological framework for villain studies. Moreover, by proposing the expansion of VillainScope into a knowledge graph, the study deepens the scholarly potential of distant reading approaches. This enriched dataset enables analyses that move beyond stereotypical villain portrayals to examine variations shaped by works, genres, and creators. Ultimately, it provides profound humanistic insights into how changing societal values are projected onto narrative characters across media.

* This work is a revised and complemented part of the author's Master's thesis, titled 'Villainscope: A Study of Emotion Analysis and Constructing a Dialogue Generation System Based on a Villain Dataset,' which was submitted to The Graduate School of Hongik University in 2025.

** GitHub: https://github.com/eiloppang/VillainScope_project

Keyword

Villain, Topic modeling, Graph-based RAG(Retrieval-Augmented Generation), Distant Reading, Digital Humanities

Reference

[1] Bandy, J., & Vincent, N. (2021). Addressing “documentation debt” in machine learning research: A retrospective datasheet for bookcorpus. arXiv preprint arXiv:2105.05241.

[2] Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., & Allen, J. (2016). A corpus and cloze evaluation for deeper understanding of commonsense stories. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 839–849).

[3] Finlayson, M. A. (2017). ProppLearner: Deeply annotating a corpus of Russian folktales to enable the machine learning of a Russian formalist theory. Digital Scholarship in the Humanities, 32(2), 284–300.

[4] Brahman, F., El-Batal, T., Muresan, S., Dligach, D., & Muresan, S. (2021). “Let Your Characters Tell Their Story”: A Dataset for Character-Centric Narrative Understanding. arXiv preprint arXiv:2109.05438.

[5] Papoudakis, A., Lapata, M., & Keller, F. (2024). BookWorm: A dataset for character description and analysis. arXiv preprint arXiv:2410.10372.

[6] Baruah, S., & Narayanan, S. (2024). CHATTER: A Character Attribution Dataset for Narrative Understanding. arXiv preprint arXiv:2411.05227.