Submission 193
Of Apples and Oranges: Comparing LLM to Human Performance Requires Experimental Rigor
SymposiumTalk-01
Presented by: Fritz Günther
The rapid advancement of Large Language Models (LLMs) has sparked substantial interest in comparing their performance with that of humans. However, determining what constitutes a fair and meaningful comparison between their performance and the underlying capacities remains an open question – especially different comparison methods lead to substantially different conclusions.
As a starting point, we systematically elicited grammaticality judgments from three LLMs via prompting, and found low linguistic accuracy, instable response patterns across repeated prompts, and a pervasive yes-response bias – all differing substantially from human responses for the very same tasks. In a response, Hu et al. (2024) argued that prompting only elicits meta-linguistic judgments, and does not assess the true underlying capacities of LLMs. Comparing the probabilities LLMs assign to minimal pairs of grammatical versus ungrammatical sentences, they found lower probabilities for the ungrammatical sentences in almost all cases. However, this fundamentally changes the comparison: comparative grammaticality judgments on pairs of sentences are a fundamentally different task than absolute grammaticality judgments on individual sentences. Additionally, we can show that higher probabilities in minimal-pair comparisons are not specific to grammaticality judgments, but are also found for pragmatically semantically sound versus odd sentences. Therefore, such minimal-pair comparisons are also not diagnostic for grammatical violations.
Together, these studies illustrate that empirical comparisons between humans and LLMs require a rigorous experimental design – aligning independent variables, dependent variables, and testing conditions.