Skip to content Skip to footer

Introducing a groundbreaking development in Text-to-Speech Synthesis: Meet NaturalSpeech-3, equipped with Factorized Diffusion Models.

Researchers from several international institutions including Microsoft Research Asia, the University of Science and Technology of China, The Chinese University of Hong Kong, Zhejiang University, The University of Tokyo, and Peking University have developed a high-quality text-to-speech (TTS) system known as NaturalSpeech 3. The system addresses existing issues in zero-shot TTS, where speech for unseen speakers is generated using various data representations and modeling techniques.

Traditional TTS systems struggle with creating high-quality outputs due to the complexity of speech, which involves multiple aspects such as content, prosody, timbre, and acoustic details. To tackle these difficulties, the research team has implemented a neural codec featuring factorized vector quantization (FVQ). This technique enables the decomposing of speech waveforms into distinct attribute subspaces, such as content, prosody, timbre, and acoustic details. Then, a factorized diffusion model generates the attributes in each subspace derived from the corresponding prompts.

NaturalSpeech 3 is an advanced TTS system designed to offer superior synthesis quality and enhanced controllability. It features diversity in synthesis across various scenarios, using large datasets for zero-shot synthesis. The system’s ‘FACodec’ function simplifies speech complexity using factorized vector quantizers, efficiently representing various speech attributes.

The revolutionary system demonstrates improved performance in speech quality, similarity, and robustness. This improvement is evidenced by an extensive evaluation of LibriSpeech and RAVDESS datasets. The system exhibits noticeable advancements, particularly in the generation quality, speaker similarity, and prosody similarity. Its scalability allows improvement with larger datasets and model sizes, revealing its potential for future advancements.

Although NaturalSpeech 3 shows promise, it currently poses limitations in voice diversity and multilingual capabilities due to its reliance on English data from LibriVox. However, the researchers are focusing on expanding data collection to overcome these limitations.

Leave a comment

0.0/5