The field of speech synthesis has seen a significant transformation in recent years with the advent of large-scale generative models. This has led to substantial advancements in zero-shot speech synthesis systems such as text-to-speech (TTS), voice conversion (VC), and editing. The objective of these systems is to generate speech by incorporating unseen speaker characteristics from a reference audio portion during inference, negating the need for additional training data.
Recent advances in speech synthesis employ language and diffusion models for in-context speech generation on large datasets. However, the conventional generative process for these models is often computationally expensive and time-consuming, highlighting the need for a more efficient solution.
Addressing this issue, a team of researchers has introduced FlashSpeech – a novel approach for efficient zero-shot speech synthesis. FlashSpeech leverages the latent consistency model (LCM) and uses the encoder of a neural audio codec to convert speech waveforms into latent vectors – a method that serves as the training target. The training process of the model incorporates adversarial consistency training, a new technique that combines consistency and adversarial training using pre-trained speech-language models as discriminators.
FlashSpeech further incorporates a prosody generator module designed to enhance the diversity of prosody while maintaining stability. By conditioning the LCM on prior vectors derived from a phoneme encoder, prompt encoder, and the prosody generator, FlashSpeech can achieve a wider range of expressions and prosody in generated speech.
In terms of performance, FlashSpeech outperforms comparable systems in audio quality and matches them in speaker similarity, all while being approximately 20 times faster. This unprecedented level of efficiency in zero-shot speech synthesis could pave the way for numerous real-world applications that require rapid and high-quality speech synthesis, such as virtual assistants, audio content creation tools, or accessibility instruments.
The development of FlashSpeech signifies a major advancement in the field of zero-shot speech synthesis. It addresses key limitations of existing solutions and utilizes recent innovations in generative modeling. As the technology continues to evolve, FlashSpeech stands as a new benchmark for efficient and effective zero-shot speech synthesis systems.
The research leading to the development of FlashSpeech was shared in a published paper and the details of the project are also available. The research was conducted by a team that is keen on using platforms like Twitter, Telegram Channel, Discord Channel, and LinkedIn Group to share updates. They also have a newsletter for those interested in their work.