Benchmarking Narrative Comprehension in Long-Context Large Language Models

Submission 84

D1_TPoster-07

Presented by: Tong Li

Tong Li

The Chinese University of Hong Kong

The proliferation of Large Language Models (LLMs) with extended context windows has spurred the development of sophisticated applications, from multi-document analysis to repository-level coding assistants, all of which demand robust long-context understanding and reasoning capabilities. However, evaluating these capabilities remains a significant challenge. Existing benchmarks, while valuable, often suffer from critical limitations that allow models to achieve high scores through shortcuts, without demonstrating true comprehension. These shortcuts include leveraging memorized parametric knowledge from pre-training data, solving tasks with localized retrieval rather than global context integration, and relying on shallow reasoning or summarization techniques. To address these shortcomings, our team led by Yu, M., et al. introduces a novel benchmark designed to rigorously assess an LLM’s ability to perform global comprehension and multi-step reasoning over long narratives.

This poster at the HKADH conference shares my experience contributing to the construction and validation of our benchmark dataset. The benchmark formulates a unique task: given the full text of a novel, the model must determine whether a newly generated, hypothetical prequel story for a key character is consistent or contradictory with the established narrative. This prequel entailment framework inherently mitigates the flaws of prior benchmarks. Because the prequels are novel creations, they do not exist in any model’s training data, thus neutralizing the memorization shortcut and forcing reliance on the provided context. The task is explicitly designed to require holistic understanding, as verifying consistency often involves aggregating subtle clues and tracking character arcs scattered across the entire book. Our analysis confirms this, revealing that 88% of instances require evidence from multiple, non-local parts of the narrative to be solved correctly. Furthermore, the task necessitates deep, multi-step reasoning, as the implications of a prequel on the canonical story are often indirect and require causal inference rather than simple fact retrieval.

Our dataset was meticulously constructed to ensure quality, diversity, and challenge. We selected literary works spanning multiple genres (e.g., adventure, historical fiction, crime fiction) and languages (English and Chinese). For key characters across these books, we used state-of-the-art LLMs to generate a pool of candidate prequels. These generated prequels were then subjected to a human annotation process. Annotators each deeply familiar with the selected books labeled each prequel according to a detailed typology that distinguishes between different forms of consistency and contradiction (e.g., Local Contradict, Global Contradict, Consistent). This process yielded a high-quality dataset of more than seven hundred instances. The annotation was guided by strict principles to ensure objectivity, such as basing judgments solely on the original novel’s content and assuming direct continuity between the prequel and the story. The quality of this process is reflected in a substantial inter-annotator agreement, with a Kappa score of 0.7828. By focusing on prequel entailment, our task design moves beyond simple information retrieval and measures a form of fluid intelligence, the ability to reason and solve novel problems, within the natural language space, a significant departure from benchmarks that test crystallized knowledge.

We conducted extensive experiments on our benchmark using a suite of state-of-the-art open-source and commercial LLMs, including the Qwen, DeepSeek, and Gemini families. We evaluated models under several paradigms: vanilla few-shot in-context learning (ICL), Retrieval-Augmented Generation (RAG), in-domain fine-tuning, many-shot ICL, and commercial agentic systems like OpenAI’s DeepResearch. Our findings reveal a significant and persistent gap between machine and human capabilities in long-context reasoning: (1) Significant Human-Machine Performance Gap: The best-performing model, Gemini-2.5-Pro, still lags behind human performance by over 15% in F1 score, highlighting the profound difficulty of the task for current systems. (2) Flawed Reasoning Underlying Correct Answers: A critical contribution of our work is a manual human study on model reasoning. We found that LLMs frequently arrive at the correct binary answer (consistent or contradict) through entirely flawed or invalid reasoning. This creates a reasoning accuracy gap of over 30% compared to humans, who demonstrate consistent and valid reasoning paths. This result underscores the superficial nature of current models’ understanding and demonstrates that answer accuracy alone is a misleading metric for evaluating true comprehension. (3) Ineffectiveness of Scaling and Advanced Methods: Our experiments show that simply providing more context via RAG does not guarantee improvement and can even degrade the performance of stronger models like Gemini-2.5-Pro, which may over-rely on their powerful parametric knowledge. Furthermore, neither in-domain fine-tuning nor many-shot ICL yielded performance gains, suggesting that the reasoning capabilities required by our benchmark are not easily elicited and represent a fundamental limitation of current architectures. (4) Resilience to Web-Based Shortcuts: Commercial DeepResearch systems performed poorly compared to RAG-based approaches, confirming that the problems in our dataset cannot be solved by retrieving existing literary analyses from the web. This validates our benchmark’s design as a true test of reasoning based on the provided context.

Selected Reference:

Yu, M., et al. (2025). PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts. ACL under review.

Agarwal, R., et al. (2024). Many-shot in-context learning. The Thirty-eighth Annual Conference on Neural Information Processing Systems.

Bai, Y., et al. (2024). Longbench: A bilingual, multitask benchmark for long context understanding. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics.

Karpinska, M., et al. (2024). One thousand and one pairs: A "novel" challenge for long-context language models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

Kočiskỳ, T., et al. (2018). The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6, 317-328.

Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

Xu, X., et al. (2025). DetectiveQA: Evaluating long-context reasoning on detective novels. Workshop on Reasoning and Planning for Large Language Models.