French non-profit AI research lab Kyutai has launched a real-time voice AI assistant Moshi, outpacing OpenAI in making its technology publicly available. Moshi is built using Helium 7B, a model developed by Kyutai, using a blend of synthetic text and audio data. The voice assistant is capable of recognising and articulating 70 distinct emotions, and speaking in numerous accents and styles.
The voice assistant’s swift reaction time, which is 200 milli-seconds end-to-end latency, allows for seamless, uninterrupted interaction. Compared to GPT-4o’s voice assistant Sky from OpenAI, Moshi responds faster, albeit without the sultry voice, and it is available to the public. Moshi was trained on audio files created by a voice actor referred to as “Alice” by Kyutai.
Despite being smaller than GPT-4o, Helium 7B can function on consumer-grade hardware or low-power GPUs in the cloud due to its compact design. This was displayed during a demonstration wherein a Kyutai engineer used a MacBook Pro to show Moshi operating on-device. This suggests a future possibility of a low-latency AI voice assistant that can function on personal devices without the need to upload private data to the cloud.
Moshi utilises an audio codec called Mimi for audio compression, crucial for making the assistant as compact as possible. Mimi compresses audio to 300 times smaller than the MP3 codec and captures both the acoustic and semantic information in any audio.
Despite currently being an experimental prototype and having some functional glitches, Moshi raises exciting possibilities on the future of AI voice assistants and is a testament to what a small team of engineers can achieve. The anticipated public release of the model, codec, code, and weights by Kyutai will hopefully bring performance capabilities closer to those demonstrated. Until then, Moshi stands as an instance of innovation in the AI community, and makes many question why the wait for GPT-4o to engage in dialogue continues.