LLMs as Models of Whose Language? A Longitudinal Comparison of Child and LLM-Generated German Corpora

16:30 - 18:00

Computational Models of Language Generation

Room: HSZ - N2

Chair/s:

Hanna Woloszyn

Computational modeling has become an increasingly important approach for studying language generation across speaking, writing, and associative processes. Approaches range from distributional semantics and transformer-based architectures to learning-based production models, each providing different ways to formalize and investigate how meaning, structure, and behavior emerge in linguistic systems. This symposium presents five studies at the intersection of cognitive science, computational linguistics, and psycholinguistics that use these methods to better understand language production and related cognitive processes.
The symposium starts with a study investigating whether LLM-generated corpora can simulate the longitudinal development of children's texts using various psycholinguistic variables to compare the produced language. The second talk explores whether visual characteristics of pictures, beyond their conceptual or lexical representations, contribute to interference effects in picture–word-interference tasks by integrating modern vision–language embeddings with behavioral data. The third project validates centroid analysis as a method to infer concept representations from participants’ open-ended verbal responses, such as free associations, word substitutions, and feature generations with word embeddings. The fourth talk proposes a computational model that accounts for semantic interference phenomena in language production by implementing an incremental learning mechanism within an interactive production network. Finally, we present a study that investigates the psychometric capacities of LMs in the verbal fluency task, an experimental paradigm used to examine human knowledge retrieval, cognitive performance, and creative abilities.
By bringing together different perspectives, the symposium encourages a discussion on what it means to "model" language production and how such modeling can contribute to our understanding of human cognition and language processing. Together, these projects will increase our understanding of language models' potential benefits and limitations.

Submission 428

LLMs as Models of Whose Language? A Longitudinal Comparison of Child and LLM-Generated German Corpora

SymposiumTalk-01

Presented by: Hanna Woloszyn

Hanna Woloszyn, Benjamin Gagl

University of Cologne, Germany

Developmental changes in linguistic skills during literacy acquisition are prominent and fast-paced. In this study, we investigate whether LLMs can effectively reproduce such longitudinal changes in human language development. We compare texts written by children with LLM-generated texts in response to picture stories. In the chosen children’s corpus, each child described one story each year, allowing us to conduct a longitudinal investigation. We generated LLM texts that reproduce the longitudinal part of the corpus. However, we employed different strategies to amend LLM prompts, each incorporating more participant-specific data. We compared human- and model-generated texts using psycholinguistic corpus analysis, focusing on lexical and syntactic analysis. We found that child texts significantly increased in length and lexical diversity with age. LLM-based corpora also showed an increase in token count with age. However, the texts were consistently longer overall and exhibited higher lexical diversity. Parts of speech also varied significantly across ages and corpora. Providing the model with additional information of varying degrees had minimal impact on the outcomes. These findings suggest that LLMs have a limited representation of child language and are unable to capture developmental changes, even when prompts are amended with detailed participant-specific information. Therefore, while LLMs may serve as useful engineering tools, for now, they cannot model child language or linguistic changes during literacy acquisition.

Bookmark