Skip to content Skip to footer

Camb AI has launched MARS5 TTS – an innovative Open Source Text to Speech model that significantly enhances prosody.

MARS5 TTS, an open-source text-to-speech system, has been released by the team at Camb AI, offering game-changing levels of precision and control in the field of speech synthesis. This innovative system can clone voices and provide nuanced control of prosody using less than 5 seconds of audio input.

MARS5 TTS utilises a two-step process involving a 750M Auto-Regressive (AR) model and a 450M Non-Auto-Regressive (NAR) model. An autoregressive transformer model generates initial coarsened encoded speech features from the input text and reference audio. These features are then refined by a multinomial Denoising Diffusion Probabilistic Model (DDPM) to create the remaining encoded codebook values. The final stage involves a vocoder, converting the output of the DDPM into the ultimate audio product.

This AR-NAR pipeline puts MARS5 ahead of other language models like GPT and Gemini, which are primarily designed for text generation and understanding, rather than text-to-speech synthesis. The inclusion of DDPM in the NAR stage and the ability to control prosody through text formatting make MARS5 stand out.

MARS5’s training on raw audio and byte-pair-encoded text allows it to adjust the prosody through punctuation and capitalization. Adding commas creates pauses, and capitalizing words provides emphasis, which guides the prosody of the generated output.

In terms of voice cloning, MARS5 offers a fast “shallow clone” that does not require the reference audio’s transcript and a slower, higher-quality “deep clone” that uses the said transcript. MARS5 can generate speech for a wide range of applications, including sports commentary and anime voiceovers, given just a short audio clip and text snippet.

Using MARS5 involves providing a reference audio file lasting between 2-12 seconds, with optimal results seen from 6-second samples. Enhanced quality can be achieved by performing a “deep clone” using the reference audio transcript, although this process is time-consuming. The system’s ability to handle complex prosodic scenarios makes it ideal for applications in entertainment, education, and accessibility.

From a broader perspective, MARS5 represents a significant step forward in the open-source text-to-speech technology scene. Its combination of AR and NAR models coupled with DDPM offers unprecedented control over speech synthesis. MARS5’s capabilities — cloning voices with minimal input and generating high-quality, prosodically rich speech — establish it as a must-have tool for developers and researchers in artificial intelligence and speech technology. As more advanced technologies continue to emerge, MARS5 acts as an example of technological excellence in the text-to-speech field, continually pushing the boundaries of what is achievable. [600 words]

Leave a comment

0.0/5