Submission 428
LLMs as Models of Whose Language? A Longitudinal Comparison of Child and LLM-Generated German Corpora
SymposiumTalk-01
Presented by: Hanna Woloszyn
Developmental changes in linguistic skills during literacy acquisition are prominent and fast-paced. In this study, we investigate whether LLMs can effectively reproduce such longitudinal changes in human language development. We compare texts written by children with LLM-generated texts in response to picture stories. In the chosen children’s corpus, each child described one story each year, allowing us to conduct a longitudinal investigation. We generated LLM texts that reproduce the longitudinal part of the corpus. However, we employed different strategies to amend LLM prompts, each incorporating more participant-specific data. We compared human- and model-generated texts using psycholinguistic corpus analysis, focusing on lexical and syntactic analysis. We found that child texts significantly increased in length and lexical diversity with age. LLM-based corpora also showed an increase in token count with age. However, the texts were consistently longer overall and exhibited higher lexical diversity. Parts of speech also varied significantly across ages and corpora. Providing the model with additional information of varying degrees had minimal impact on the outcomes. These findings suggest that LLMs have a limited representation of child language and are unable to capture developmental changes, even when prompts are amended with detailed participant-specific information. Therefore, while LLMs may serve as useful engineering tools, for now, they cannot model child language or linguistic changes during literacy acquisition.