Characterizing Sustained Phonation in Text-To-Speech Models

Amelie Daum, Nina Goes, Andreas M. Kist
Paper Code TTS Audio Files Lab Website ↓ TTS Audio Samples ↓

Abstract

Sustained phonation is a central task in clinical voice assessment and provides a controlled setting to quantify acoustic voice characteristics. In contrast, the evaluation of modern text-to-speech (TTS) systems still relies predominantly on perceptual ratings such as the Mean Opinion Score (MOS), leaving open whether these systems can reliably generate sustained phonation and how their acoustic properties compare to human voices. The capability of TTS models to reproduce clinically relevant voice features remains insufficiently characterized. Here, we systematically examine sustained phonation in contemporary TTS systems and compare synthetic and human voice samples using common acoustic measures. Multiple TTS models were screened for their ability to generate sustained vowels, such as /a/. One model, namely Eleven v3 by ElevenLabs, was subsequently analyzed in detail with respect to the distribution of phonation durations, the relationship between prompt length and generated duration, and differences between vowels and speaker types. Finally, TTS-generated sustained phonations were compared with human recordings from two independent cohorts using established clinical voice parameters. We found that TTS systems were able to produce sustained phonation, although reliability varied between models. For the selected Eleven v3 model, phonation durations showed non-normal distributions and were partially predicted by prompt length. Most acoustic measures of synthetic samples overlapped with the ranges observed in human voices, while selected parameters showed statistically significant but inconsistent differences across vowels. These findings indicate that current TTS models can approximate key acoustic characteristics of sustained phonation, while also exhibiting systematic deviations that should be considered in applications involving clinical voice metrics and in further development of realistic TTS systems.

Paper

Audio Samples

The audio samples below demonstrate attempts to generate sustained phonation using contemporary text-to-speech (TTS) models. All audio files are available here.

Successful Generations
Sustained phonation of the vowel /a/ generated by Eleven v3 (ElevenLabs), female voice.
Sustained phonation of the vowel /a/ generated by Eleven v3 (ElevenLabs), male voice.
Failed Generations
Failed sustained phonation attempt of /a/ due to chopping the phonation in multiple segments. Model: Speech-02-hd by MiniMax
Failed sustained phonation of /a/ due to incorrect phoneme production and abnormally elevated pitch. Model: P1 by Papla
Failed sustained phonation attempt of /a/ due to random text being spoken. Model: Eleven v3 by ElevenLabs