Large Language Models for Statistical Inference: Context Augmentation with Applications to the Two-Sample and Regression Problems

P5-S125-4

Presented by: Marc Ratkovic

Marc Ratkovic , Haoyu Zhai

University of Mannheim

Statistical inference with text data poses challenges due to the high dimensionality and unstructured nature of language. We introduce context augmentation, a framework that leverages large language models (LLMs) to generate semantically rich contexts for observed text, facilitating statistical analysis. Building upon classical foundations in data augmentation and empirical process theory, our approach decomposes complex text-event relationships into tractable probabilistic components under minimal assumptions. We demonstrate the utility of this framework through applications to the two-sample problem and regression analysis. Empirical examples illustrate its practical effectiveness in real-world scenarios. By bridging statistical inference and natural language processing, our work provides tools for inference on unstructured text data and connects to recent advances in text modeling and causal inference.

Keywords: Large Language Models, Causal Inference, Political Methodology

Sponsors