Assessing the Validity of LLM-Derived Psycholinguistic Measures: The Case of Word Familiarity

11:00 - 12:30

Language Models as Tools for Psycholinguistics

Room: HSZ - N2

Chair/s:

Cosimo Iaia, Jack E Taylor

The development of powerful computational language models in recent years has seen an increasing application of such models in psychology and psycholinguistics. Both Distributional Semantic Models (e.g., word2vec, GloVe, etc) and Large Language Models (e.g., GPT, BERT) have been applied in two main ways: (1) as models of human language processing, and (2) as tools for generating measures that are relevant to psycholinguistic theories and hypotheses. However, the distinction between these two applications of language models is not always clear, and both applications are limited by fundamental differences in language processing between models engineered to achieve human-like performance, and the processes actually used in human language. Nevertheless, language models have demonstrated strong potential for providing insight into language processes. This symposium brings together five talks to address recent developments in the use of language models as tools for psycholinguistics, and the degree to which such models provide comparisons and outputs that are meaningful for progress in the field. The first talk will set a theoretical foundation for the symposium, evaluating caveats of comparing Large Language Models to humans, and outlining how meaningful comparisons require rigorous experimental methods. The second talk explores whether humans and language models (both Large Language Models and Distributional Semantic Models) represent abstract meaning in a similar way, while also highlighting differences emerging between the two systems. The third talk shows how Large Language Models can be used to generate new iconicity ratings for Turkish, providing a new avenue for investigating semantic dimensions in otherwise understudied languages. The fourth talk evaluates how well estimates of word frequency and familiarity derived from Large Language Models can explain children’s reading times. Finally, in the last talk, Distributional Semantic Models are applied to provide insight into the learning of morphology, showing that natural text provides sufficient information to learn the meanings of prefixes and suffixes. Together, these talks highlight the ongoing potential of language models as tools for psycholinguistics. However, these talks also provide opportunity for important discussion on the caveats of this approach, and on the scientific applications language models can support.

Submission 423

Assessing the Validity of LLM-Derived Psycholinguistic Measures: The Case of Word Familiarity

SymposiumTalk-04

Presented by: Job Schepens

Job Schepens

University of Cologne, Germany

Recent research suggests that LLM-generated word familiarity ratings outperform traditional frequency measures in predicting adult lexical decision times (Brysbaert et al., 2024). Here, we report a validaition study that assesses whether these ratings indeed reflect lexical knowledge learned by the model, or if they partially result from memorization of existing human rating datasets present in the LLM's training data. This distinction has important implications for the validity of LLM-derived psycholinguistic measures.

We address this training-data leakage concern by using open-source language models with publicly documented training data. In a first step, we select two LLMs and determine whether their training data excludes existing word familiarity ratings. Second, we derive familiarity estimates using appropriate prompting.

We evaluate these LLM-derived measures against German children's word reading times from the DeveL corpus, comparing them to traditional corpus frequencies from DWDS and childLex. Using pre-registered analyses, we test whether LLM-based familiarity and frequency measures explain variance in reading times beyond standard controls (word length, orthographic features).

This study provides an empirical validation of LLM-derived lexical measures while addressing concerns about training-data contamination. Our findings inform best practices for using language models as tools in psycholinguistic research and clarify the conditions under which LLM-generated measures can be used for obtaining lexical variables.

Brysbaert, M., Martínez, G., & Reviriego, P. (2024). Moving beyond word frequency based on tally counting: AI-generated familiarity estimates of words and phrases are an interesting additional index of language knowledge. Behavior Research Methods, 57(1), 28.

Bookmark