Extending Scholarly Databases with LLMs: Opportunities and Challenges

Submission 114

SP01-03

Presented by: Donald Sturgeon

Donald Sturgeon

Department of Computer Science, Durham University

The advent of increasingly sophisticated deep learning models – in particular, practical generative Large Language Models (LLMs) capable of producing fluent, human-like text – presents opportunities and challenges for large-scale digital analysis of historical writing, as well as for digital libraries and other systems that mediate human access to historical primary source materials. Generative tasks starting with – but by no means limited to – translation, summarization, and question answering provide extensive opportunities for a variety of automated contextual assistance for navigation and reading, as well as for extraction of many types of information that would previously require human involvement and/or the training of special-purpose models to obtain.

This paper provides an overview of recent progress and ongoing work integrating a variety of modern AI techniques into the Chinese Text Project (https://ctext.org), a widely used digital library of premodern Chinese written works. Recent concrete applications of NLP and AI to this project include in particular: 1) augmentation of the content of a historical knowledge graph by extracting precise machine-readable data from natural language descriptions found in historical sources; 2) contextualization of the texts through automated and semi-automated annotation; 3) use of generative AI to provide precisely referenced multilingual natural language contextualization for historical entities leveraging primary source evidence; and 4) the application of embedding-based retrieval for concept and similarity search, as well as text reuse identification, across large volumes of text.

Using concrete examples encountered during development, this paper considers how more traditional digital approaches can be productively combined with the use of modern AI techniques to improve outcomes and address limitations common to many contemporary LLMs. In particular, this paper explores the extent to which explicit contextualization can help reduce the degree of hallucination in content generation based on premodern primary source materials. By integrating data previously curated through primary source text annotation and knowledge graph construction into LLM text generation, this work demonstrates how reducing contextual ambiguity can help enable more reliable and trustworthy LLM-generated contextualization of historical primary source content.