Of Apples and Oranges: Comparing LLM to Human Performance Requires Experimental Rigor

11:00 - 12:30

Language Models as Tools for Psycholinguistics

Room: HSZ - N2

Chair/s:

Cosimo Iaia, Jack E Taylor

The development of powerful computational language models in recent years has seen an increasing application of such models in psychology and psycholinguistics. Both Distributional Semantic Models (e.g., word2vec, GloVe, etc) and Large Language Models (e.g., GPT, BERT) have been applied in two main ways: (1) as models of human language processing, and (2) as tools for generating measures that are relevant to psycholinguistic theories and hypotheses. However, the distinction between these two applications of language models is not always clear, and both applications are limited by fundamental differences in language processing between models engineered to achieve human-like performance, and the processes actually used in human language. Nevertheless, language models have demonstrated strong potential for providing insight into language processes. This symposium brings together five talks to address recent developments in the use of language models as tools for psycholinguistics, and the degree to which such models provide comparisons and outputs that are meaningful for progress in the field. The first talk will set a theoretical foundation for the symposium, evaluating caveats of comparing Large Language Models to humans, and outlining how meaningful comparisons require rigorous experimental methods. The second talk explores whether humans and language models (both Large Language Models and Distributional Semantic Models) represent abstract meaning in a similar way, while also highlighting differences emerging between the two systems. The third talk shows how Large Language Models can be used to generate new iconicity ratings for Turkish, providing a new avenue for investigating semantic dimensions in otherwise understudied languages. The fourth talk evaluates how well estimates of word frequency and familiarity derived from Large Language Models can explain children’s reading times. Finally, in the last talk, Distributional Semantic Models are applied to provide insight into the learning of morphology, showing that natural text provides sufficient information to learn the meanings of prefixes and suffixes. Together, these talks highlight the ongoing potential of language models as tools for psycholinguistics. However, these talks also provide opportunity for important discussion on the caveats of this approach, and on the scientific applications language models can support.

Submission 193

Of Apples and Oranges: Comparing LLM to Human Performance Requires Experimental Rigor

SymposiumTalk-01

Presented by: Fritz Günther

Fritz Günther ¹, Alida Bornemann ¹, Hendrik Berbuir ¹, Vittoria Dentella ², Evelina Leivada ^{3, 4}

¹ Humboldt-University, Berlin, Germany

² University of Pavia, Italy

³ Autonomous University of Barcelona, Spain

⁴ Catalan Institution for Research and Advanced Studies, Spain

The rapid advancement of Large Language Models (LLMs) has sparked substantial interest in comparing their performance with that of humans. However, determining what constitutes a fair and meaningful comparison between their performance and the underlying capacities remains an open question – especially different comparison methods lead to substantially different conclusions.

As a starting point, we systematically elicited grammaticality judgments from three LLMs via prompting, and found low linguistic accuracy, instable response patterns across repeated prompts, and a pervasive yes-response bias – all differing substantially from human responses for the very same tasks. In a response, Hu et al. (2024) argued that prompting only elicits meta-linguistic judgments, and does not assess the true underlying capacities of LLMs. Comparing the probabilities LLMs assign to minimal pairs of grammatical versus ungrammatical sentences, they found lower probabilities for the ungrammatical sentences in almost all cases. However, this fundamentally changes the comparison: comparative grammaticality judgments on pairs of sentences are a fundamentally different task than absolute grammaticality judgments on individual sentences. Additionally, we can show that higher probabilities in minimal-pair comparisons are not specific to grammaticality judgments, but are also found for pragmatically semantically sound versus odd sentences. Therefore, such minimal-pair comparisons are also not diagnostic for grammatical violations.

Together, these studies illustrate that empirical comparisons between humans and LLMs require a rigorous experimental design – aligning independent variables, dependent variables, and testing conditions.

Bookmark