Evaluating language model alignment with free associations

Tue—HZ_10—Talks4—3804

Presented by: Dirk Wulff

Dirk Wulff ^{1, 2}^*, Zak Hussain ^{2, 1}, Samuel Aeschbach ^{1, 2}, Rui Mata ²

¹ Max Planck Institute for Human Development, ² University of Basel

The alignment between large language models and humans' knowledge and preferences is central to the safe and fair deployment of such tools. A number of approaches to quantifying alignment exist, but current work is fragmented, preventing an overview across categories of stimuli and demographic groups. We propose that free associations from massive citizen-science projects can advance representational alignment by helping evaluate both content and demographic inclusivity. We assess the representational alignment of multiple closed (e.g., GPT-4) and open models (e.g., Llama-3.1) and data from the English Small World of Words Study (ca. 80.000 respondents, 3.7 million responses). Our results indicate that while the language model can capture some procedural signatures of human responses, it shows heterogeneous alignment across stimuli categories, poor representational alignment for controversial topics (e.g., religion, nationality), and differential representation of demographic groups (e.g., males, females). All in all, our work suggests that free association can be used to evaluate the representational alignment of large language models.

Keywords: large language models, semantic representations, free associations, individual differences