LLMs and the Translation of Culturally Loaded Words in Dream of the Red Chamber

Submission 26

D1_TPoster-02

Presented by: Jiayuan Di

Jiayuan Di

East China University of Science and Technology, School of Foreign Languages, Shanghai, China

This study examines how large language models (LLMs) handle the translation of culturally loaded words in Dream of the Red Chamber (《红楼梦》) from Chinese into Japanese, a task that epitomizes the challenges of cross-cultural knowledge transmission in the age of AI. Using the authoritative bilingual edition translated by Ito Shōhei, we constructed a dataset of 500 representative expressions categorized into five cultural dimensions: linguistic, material, social, religious, and ecological. Eight state-of-the-art LLMs—including ChatGPT-4.5, DeepSeek-V3, Claude Sonnet 4, Gemini-2.5, and Qwen3—were systematically evaluated through a mixed-method approach combining automatic metrics (BLEU, COMET, BERTScore) with human evaluation (MQM framework).

Research Process

This study adopted a mixed-method design to evaluate the performance of large language models in translating culturally loaded words from Dream of the Red Chamber into Japanese. First, we selected 500 representative sentences from the bilingual edition translated by Ito Shōhei, categorizing them into five cultural dimensions: linguistic, material, social, religious, and ecological. Second, we tested eight state-of-the-art LLMs—such as ChatGPT-4.5, DeepSeek-V3, Claude Sonnet 4, Gemini-2.5, and Qwen3—by prompting them to generate Japanese translations. Third, translation outputs were assessed through both automatic metrics (BLEU, COMET, BERTScore) and human evaluation based on the MQM framework, covering dimensions such as fluency, accuracy, cultural appropriateness, and stylistic consistency. This multi-layered evaluation allowed us to compare model performance across cultural categories and identify their translation strategies, strengths, and shortcomings.

Results

From the perspective of cultural categories, ecological and religious terms obtained the highest scores, indicating relatively stable cross-cultural semantics. Social and material terms ranked in the middle: the former often suffered from mismatches in honorific and etiquette systems, while the latter lacked adequate cultural annotations, leading to semantic loss (e.g., literal translations of terms such as silver-thread noodles). Linguistic terms received the lowest scores, as idioms, parallel expressions, and literary allusions frequently caused semantic drift and stylistic distortion, highlighting the models’ limitations in handling highly condensed and culturally embedded language.

From the perspective of model performance, ChatGPT-4.5 and DeepSeek-V3 showed the strongest results, excelling in fluency and semantic alignment though still struggling with deep cultural fidelity. o4-mini and Gemini-2.5 performed at a moderate level, producing natural but often simplified translations. Claude Sonnet 4 remained steady but conservative, lacking cultural depth. By contrast, DeepSeek-R1 and Qwen3 models ranked lowest, with frequent mistranslations in linguistically and socially dense contexts. Overall, the results indicate that while LLMs demonstrate strong fluency in surface generation, their ability to faithfully convey cultural meanings remains limited.

Discussion and Implications

By situating literary translation within the framework of distant reading and computational evaluation, this study highlights both the potentials and limitations of AI in mediating classical Chinese culture across languages. The findings suggest that while LLMs can serve as powerful tools for cultural analysis and distant translation, the faithful transmission of culturally loaded texts still requires human interpretive intervention.