In the domain of large language models (LLMs), text-to-speech (TTS) synthesis presents a unique challenge, and researchers are exploring their potential for audio synthesis. Historically, systems have used various methodologies, from reassembling audio segments to using acoustic parameters, and more recently, generating mel-spectrograms directly from text. However, these methods face limitations like lower fidelity and robustness issues due to random sampling strategies.
There’s a recent emphasis on zero-shot TTS capabilities, which synthesizes speech that adapts to any text, speaker, or language without further training. Models like VALL-E, ELLA-V, RALL-E, and VALL-E R aim to tackle this, but still, face certain limitations, particularly in maintaining audio quality and efficiently handling tasks across languages and speakers.
To overcome these issues, researchers from The Chinese University of Hong Kong and Microsoft Corporation have introduced MELLE, a novel approach to TTS synthesis that utilizes continuous-valued tokens based on mel-spectrograms. The MELLE model operates as a single-pass zero-shot TTS system and generates mel-spectrogram frames based on previous tokens, addressing robustness issues associated with discrete codec codes and improving fidelity and efficiency in speech synthesis.
MELLE uses regression loss with a spectrogram flux loss function instead of the traditional cross-entropy loss, which makes modeling the probability distribution of continuous valued tokens more effective. It also incorporates variational inference, which enhances the diversity and robustness of the output.
MELLE’s construction includes a variety of innovative parts, such as an autoregressive Transformer decoder, a latent sampling module, a stop prediction layer, and a convolutional post-net for spectrogram refinement. Functioning without the need for a separate non-autoregressive model or two-pass procedures, it offers a bastion of efficiency, fidelity, and performance in the field of TTS synthesis.
When compared to VALL-E and its variants, MELLE shows improved performance in zero-shot speech synthesis, demonstrating a notable reduction in word-error-ratio on continuation and cross-sentence tasks. MELLE’s consistency extends even when the reduction factors are increased, showing continual high robustness across different settings.
In conclusion, MELLE represents a major step forward in zero-shot TTS synthesis and continuous acoustic representation-based language modeling. It offers more varied and robust predictions and has shown results comparable to human performance in subjective evaluations. This model showcases the potential for significant advancements in the audio synthesis field.