Researchers at the University of Texas at Austin and Rembrand have developed a new language model known as VOICECRAFT. This Nvidia’s technology uses textless natural language processing (NLP), marking a significant milestone in the field as it aims to make NLP tasks applicable directly to spoken utterances.
VOICECRAFT is a transformative, neural codec language model (NCLM) that creates neural speech codec tokens for infilling using autoregressive conditioning on bidirectional context. This unique approach allows for zero-shot text-to-speech and speech editing, offering high-quality results. The model operates on a two-stage token rearrangement process—a delayed stacking step and a causal masking step. This method was inspired by the successful causal masked multimodal model used in joint text-image modelling.
To test VOICECRAFT, researchers developed a challenging dataset, REALEDIT. This dataset includes 310 real-world voice editing samples from audiobooks, YouTube videos, and podcasts. RealEdit is designed to cope with a variety of editing scenarios like adding, removing, substituting, or adjusting multiple spans simultaneously. It is characterized by its diversity in the recordings’ subject matter, accents, speaking styles, recording environments and background noises, making it more challenging than traditional speech synthesis assessment datasets.
The researchers report that when tested on the REALEDIT dataset, VOICECRAFT performed remarkably well, especially in subjective human listening tests. The model demonstrated its ability to modify original audio while maintaining almost identical sound quality. VOICECRAFT was found to perform better than strong baseline models like VALL-E and XTTS V2 in zero-shot text-to-speech tasks, without needing fine-tuning.
Despite its progress, the team acknowledges some limitations to VOICECRAFT, including long periods of silence followed by scratching sounds during generation. The model also poses questions regarding the watermarking and identification of synthetic speech, as these are critical issues related to the security of AI.
In response to these challenges, the team has made their code and model weights available to the public, aiming to assist further research into AI safety and speech synthesis.
The University of Texas at Austin and Rembrand’s breakthrough in developing VOICECRAFT represents a significant advancement in the field of textless natural language processing. It not only displays the potential for such models to operate incredibly well but also raises new safety and security questions that must be addressed as these technologies develop.