Human vs. Large Language Model-Based Sampling: Evidence from Large-Scale Replications

Submission 154

panel.5-224 - Floor 1-03

Presented by: Tsz Kwan Wong

Tsz Kwan Wong, Maximilian Maier

University of Warwick

As large language models are increasingly used in social science research, there remains controversy about the extent to which they can be meaningfully employed as experimental participants. Some papers argue that they can approximate human responses well (Zhang et al., 2024), while other papers caution against their use (Dillion et al., 2023, Sarstedt et al., 2024). We revisit replication studies in psychology, business and management from two large replication projects, Many labs 2 (ML2) and Management Science Replication Project (Many Labs 2 & Klein, 2018; Davis et al., 2023) using large language models (LLMs) as experimental participants by engaging LLMs in conversational tasks that mirror original experimental designs. We employ multiple commercial and open source models to compare direct prompting of models to a silicon sampling approach. We extend existing silicon sampling approaches (Argyle et al., 2023) in which models are assigned distinct preset personas mostly based on demographic variables by including psychological variables, such as the big 5.

Our pilot results focused on many labs 2 show that modelling participant heterogeneity through silicon sampling substantially increases variability of responses. The average variance of all question responses with silicon sampling was 2.4 times larger than the average variance without silicon sampling. We classify a study as replicated when the replication matches the primary study in statistical significance and direction. Using this definition, for selected studies, 52.9% of the primary findings replicated in the Many Labs 2 project. When comparing LLM outcomes to Many Labs 2, 37.5% of studies showed consistent signals under silicon sampling, compared with 7.7% under non-silicon sampling. Focusing only on the subset of studies that successfully replicated in Many Labs 2, consistency rates were 55.6% under silicon sampling, compared with 16.7% without silicon sampling. For studies that did not replicate in Many Labs 2, consistency rates were substantially lower overall. Under silicon sampling, 14.3% of these studies showed consistent signals with Many Labs 2, compared with 0% under non-silicon sampling.

These findings inform ongoing debates about whether, and under what conditions, LLMs can serve as substitutes for human participants. While silicon sampling somewhat improved LLMs' similarity to human participants, we overall find low consistency between LLMs and ML2 replications, especially for studies that failed to replicate in ML2.