Submission 423
Assessing the Validity of LLM-Derived Psycholinguistic Measures: The Case of Word Familiarity
SymposiumTalk-04
Presented by: Job Schepens
Recent research suggests that LLM-generated word familiarity ratings outperform traditional frequency measures in predicting adult lexical decision times (Brysbaert et al., 2024). Here, we report a validaition study that assesses whether these ratings indeed reflect lexical knowledge learned by the model, or if they partially result from memorization of existing human rating datasets present in the LLM's training data. This distinction has important implications for the validity of LLM-derived psycholinguistic measures.
We address this training-data leakage concern by using open-source language models with publicly documented training data. In a first step, we select two LLMs and determine whether their training data excludes existing word familiarity ratings. Second, we derive familiarity estimates using appropriate prompting.
We evaluate these LLM-derived measures against German children's word reading times from the DeveL corpus, comparing them to traditional corpus frequencies from DWDS and childLex. Using pre-registered analyses, we test whether LLM-based familiarity and frequency measures explain variance in reading times beyond standard controls (word length, orthographic features).
This study provides an empirical validation of LLM-derived lexical measures while addressing concerns about training-data contamination. Our findings inform best practices for using language models as tools in psycholinguistic research and clarify the conditions under which LLM-generated measures can be used for obtaining lexical variables.
Brysbaert, M., Martínez, G., & Reviriego, P. (2024). Moving beyond word frequency based on tally counting: AI-generated familiarity estimates of words and phrases are an interesting additional index of language knowledge. Behavior Research Methods, 57(1), 28.