Role of language comprehension on humanness perception in speech
Mon—HZ_13—Talks3—3303
Presented by: Janniek Wester
Though the quality of synthetic speech increases, people are still able to identify whether speech is produced by a human or not. In this study, we examine how comprehension enables us to decide on the humanness of the signal. A total of 120 participants, including native German, Spanish and Turkish listeners, evaluated speech samples in terms of “how human the voice sounds to them” in an onsite experiment (for German speakers) and in an online version (for Spanish and Turkish speakers). The speech stimuli were four short German sentences that were then manipulated to obtain three more conditions conveying syntactic but no semantic information (jabberwocky sentences), no syntactic but semantic information (wordlists), and no syntactic or semantic information (jabberwocky wordlists). This material was generated by eight voices from text-to-speech (TTS) tools (Microsoft, Elevenlabs, Murf, and Listnr) and recorded by eight human speakers. Using a linear mixed model approach, we explored the effects of voice category (natural versus synthetic), of syntactic and semantic information, and their interaction with native language, on humanness perception. The results for the German native speakers show that humanness rating of the speech samples was significantly predicted by voice type and the presence of syntax and semantics, with normal sentences rated as sounding more human than the other sentence conditions. Responses of the Spanish and Turkish speaking participants (and the interaction with other factors) will clarify whether this effect is related to the comprehension of the language or to the prosodic properties of the manipulated conditions.
Keywords: TTS, text-to-speech, naturalness, prosody, voice