Can Large Language Models (LLMs) describe pictures like children? A comparative corpus study.

Mon—HZ_9—Talks2—1203

Presented by: Hanna Woloszyn

Hanna Woloszyn ^{1, 2}^*, Benjamin Gagl ^{1, 2}

¹ University of Cologne, ² Self Learning Systems Lab

Recent developments in Large Language Models (LLMs) have introduced new opportunities for linguistic research. This study explores whether LLMs can replicate children’s language by comparing an LLM-generated corpus to the Litkey Corpus, a collection of German children’s texts describing a set of eight picture stories. First, we generate the same number of texts as in the Litkey Corpus, prompting a multimodal LLM with identical picture stories. Having both the Litkey and the LLM-generated text, we can now use corpus analysis for a detailed comparative analysis. As metrics, the study investigates, as a first step, lexical-level analyses, focusing on word frequency and lexical richness. The results indicate that although the LLM-generated corpus contains more tokens (LLM generated more words), the Litkey Corpus demonstrates greater lexical richness, suggesting that children’s active vocabulary is more extensive. Word frequency investigations showed that both corpora follow natural language patterns (i.e., Zipf’s law), indicated by few high-frequency words and many low-frequency ones. Also, we found a considerable correlation between the word frequency calculated from the corpora (r = 0.46). We identified differences between corpora in the distribution of medium- and low-frequency words. The children used more words with high frequency than we found in the LLM corpus. The findings of this study suggest that while LLMs can, in part, describe pictures like children, significant limitations need to be addressed. New investigations should test systematic model parameter variations and different metrics to improve LLM-generated child corpora, thereby contributing to the creation of much-needed psycholinguistic resources.

Keywords: Large Language Models, corpus linguistics, word frequency measures, children’s language