Perplexity as a Measure of Cognitive Freedom: Defamiliarization in Modern Chinese

Submission 101

SP05-05

Presented by: Maciej Kurzynski

Maciej Kurzynski

Lingnan University, Hong Kong

This study explores the stylistic tension between predictability and surprise by employing the predictive mechanism of large language models (LLMs) as a proxy for historically-situated readerly expectations. This approach is informed by a growing body of research revealing a significant alignment between the internal representations of LLMs and neural activity in the human brain. Connecting to a century-old dialogue between probabilists and humanists—from Andrey Markov’s analysis of Pushkin to Claude Shannon’s information theory—this work uses perplexity to quantify a model's "surprise" when encountering a text. Such a metric allows for new, empirical insights into the tension between the familiar and the unexpected, a dynamic central to Viktor Shklovsky’s theory of defamiliarization (ostranenie). In this framework, style can be understood not as a static property of a text, but as a dynamic, processual engagement with a reader's cognitive and predictive faculties.

The core of the research is a two-stage simulation designed to model the shaping of a reader's linguistic worldview in post-1949 China. First, a GPT-style Transformer model (223M parameters) was pre-trained from scratch on a large, high-quality, and general-purpose corpus of modern Chinese (FineWeb Edu V2). This "base model" represents a neutral reader with a broad, statistical understanding of the language. In the second stage, this model was fine-tuned exclusively on the Selected Works of Mao Zedong 毛澤東選集 for five epochs. This process simulates the intense, repetitive exposure to a single, dominant idiolect, effectively creating a specialized reader of "Maospeak"—the militant, ideologically charged language style that saturated Chinese public life during the Mao era (1949-1976) and continues to influence the Chinese language to this day.

By tracking the decrease in the model's perplexity on the Mao corpus before and after fine-tuning, the study identifies the core phraseology that becomes "automatized" through repeated exposure. The phrases with the most significant drop in perplexity are central to the era's political machinery, including canonical lists of class adversaries ("landlords, rich peasants, counter-revolutionaries, bad elements, and Rightists"), specific labels for political targets ("unrepentant capitalist-roader"; "Soviet revisionist renegade clique"), and formulaic rhetorical structures ("resolutely, thoroughly, wholly, and completely annihilate"). Visualizing these sequences as "perplexity landscapes" reveals a characteristic pattern: low-perplexity "canyons" of predictable slogans and jargon, interspersed with occasional peaks of surprise. This process offers a computational analogue for the Shklovskian concept of familiarization, where a powerful discourse creates a predictable linguistic universe that pushes its core tenets into the cognitive background.

This specialization, however, comes at a cognitive cost. The study demonstrates that as the model becomes an expert in Maospeak, its ability to predict the more varied and lexically diverse language of modern Chinese novels diminishes. The literary texts become comparatively more alien and surprising, a phenomenon termed "cognitive overfitting." This trade-off highlights the opposing principles at play in literary art. An analysis of excerpts from three major Chinese novelists illustrates how literature operates through defamiliarization:

In Zhang Wei’s 張煒 The Ancient Ship 古船 (1987), a character delivers a speech packed with authentic Maoist-era slogans. The fine-tuned model easily predicts these embedded phrases, demonstrating the method's ability to detect the intertextual presence of a dominant discourse within a literary work.

In Zhang Ling’s 張翎 Aftershock 餘震 (2010), a novel about the trauma of the Tangshan earthquake, the perplexity plot reveals high-surprise spikes on startlingly precise monosyllabic verbs (e.g., 愣, "to stun"; 扯, "to pull") and key nouns. These moments of high perplexity, which drive the narrative forward, are set against a low-perplexity scaffolding of conventional narrative phrasing, creating a dynamic interplay between a surprising figure and a familiar ground.

In Dung Kai-cheung’s 董啟章 As Vivid as Real 天工開物・栩栩如真 (2005), the author deliberately inserts Cantonese words (e.g., 嗰, "that"; 係, "to be") into standard written Chinese. These non-normative topolects, underrepresented in the model's training data, generate sharp perplexity spikes, disrupting the automatized perception of the standard Chinese language and forcing both human and machine readers to confront the text's cultural specificity.

Ultimately, this study suggests that style can be viewed as a "cognitive signature": a text's unique strategy for managing a reader's attention by orchestrating the dialectic between predictability and surprise. While engineered political language like Maospeak seeks to minimize perplexity and reinforce ideology through self-similar, low-entropy patterns, literary language thrives on generating "non-anomalous surprise." Literature uses a predictable backdrop of narrative and linguistic convention to make its high-perplexity focal points (a startling metaphor, a disruptive dialect, a novelistic event) more impactful. This study suggests that while engineered discourses may seek to narrow the scope of thought, literature remains a vital training ground for cognitive freedom, teaching us to navigate, and even cherish, the unexpected.