When Averages Shine: Computing Group- and Individual-Level Concept Representations Using Centroid Analysis

16:30 - 18:00

Computational Models of Language Generation

Room: HSZ - N2

Chair/s:

Hanna Woloszyn

Computational modeling has become an increasingly important approach for studying language generation across speaking, writing, and associative processes. Approaches range from distributional semantics and transformer-based architectures to learning-based production models, each providing different ways to formalize and investigate how meaning, structure, and behavior emerge in linguistic systems. This symposium presents five studies at the intersection of cognitive science, computational linguistics, and psycholinguistics that use these methods to better understand language production and related cognitive processes.
The symposium starts with a study investigating whether LLM-generated corpora can simulate the longitudinal development of children's texts using various psycholinguistic variables to compare the produced language. The second talk explores whether visual characteristics of pictures, beyond their conceptual or lexical representations, contribute to interference effects in picture–word-interference tasks by integrating modern vision–language embeddings with behavioral data. The third project validates centroid analysis as a method to infer concept representations from participants’ open-ended verbal responses, such as free associations, word substitutions, and feature generations with word embeddings. The fourth talk proposes a computational model that accounts for semantic interference phenomena in language production by implementing an incremental learning mechanism within an interactive production network. Finally, we present a study that investigates the psychometric capacities of LMs in the verbal fluency task, an experimental paradigm used to examine human knowledge retrieval, cognitive performance, and creative abilities.
By bringing together different perspectives, the symposium encourages a discussion on what it means to "model" language production and how such modeling can contribute to our understanding of human cognition and language processing. Together, these projects will increase our understanding of language models' potential benefits and limitations.

Submission 191

When Averages Shine: Computing Group- and Individual-Level Concept Representations Using Centroid Analysis

SymposiumTalk-03

Presented by: Aliona Petrenco

Aliona Petrenco, Fritz Günther

Department of Psychology, Humboldt-University, Berlin, Germany

While large-scale vector space models can be used to construct general, population-level meaning representations, they are often not suited for measuring concepts in specific individuals or groups, or within particular situations and contexts. To address this gap, the present work introduces centroid analysis—a computational method for quantifying variability in meaning representations by mapping open-ended verbal responses onto a semantic vector space and representing concepts as geometric centres (centroids) of the responses they elicit.

We evaluate this method using two distributional semantic models across several calculation methods, reference lexicon sizes, response types, and datasets with tasks ranging from single word substitutions to single and multiple free associations and multiple feature generation.

At the group level, results show that centroid analysis performs best with multiple free associations (about 70 unique and 245 total responses per cue), using fastText for meaning-to-vector mapping for responses and cue concepts, and considering each response in the centroid calculation as often as it occurred in the data. In this setting, the cue concept is identified as the closest neighbour in the semantic neighbourhood of its response centroid in 50% of cases and within the 20 closest neighbours in 85% of cases.

At the individual level, the best results are obtained using fastText and including at least eight responses per item per participant in the centroid calculation. In this setting, the cue concept is the closest neighbour of its response centroid in 22% of cases and within the 20 closest neighbours in 60% of cases.

Bookmark