In a significant reveal that has shaken the world of technology, Kyutai introduced Moshi, a pioneering real-time native multimodal foundation model. This new AI model emulates and exceeds some functionalities previously demonstrated by OpenAI’s GPT-4o. Moshi understands and delivers emotions in various accents, including French, and can simultaneously handle two audio streams, allowing it to listen and speak concurrently. Developed with rigorous processes involving a fine-tuning process of 100,000 “oral-style” synthetic conversations, Kyutai has ensured Moshi is accessible to a wide number of users, offering a smaller variant that can operate on MacBook or consumer-sized GPUs.
Moshi operates on the principle of responsible AI. Kyutai has thus incorporated features such as watermarking to identify AI-generated audio, an initiative currently in progress. The decision to provide Moshi as an open-source project underlines Kyutai’s dedication to openness and joint development within the AI community.
At its foundation, Moshi is run by a 7-billion parameter multimodal language model that processes speech input and output. The model functions with a dual-channel I/O, generating text tokens and audio codecs simultaneously. The base text language model, Helium 7B, was developed from scratch and then trained jointly with text and audio codecs. The speech codec, based on Kyutai’s Mimi model, allows a 300x compression factor, capturing semantic and acoustic information.
Looking ahead, Kyutai has ambitious plans for Moshi. It intends to release a technical report and open model versions, including the inference codebase, the 7B model, the audio codec, and the full optimized stack. Moshi’s future versions (1.1, 1.2, and 2.0) will refine the model based on user feedback and are expected to encourage widespread acceptance and innovation due to its permissive licensing.
In conclusion, Moshi illustrates the extraordinary advancements in AI technology a small, focused team can achieve. It demonstrates the power of AI to transform research assistance, brainstorming, language learning, and more. The open-source nature of the model invites collaboration and innovation, ensuring that the advantages of this revolutionary technology are accessible to all.
Lastly, Kyutai thanked its researchers for this breakthrough and led followers to their Twitter and LinkedIn profiles. They also mentioned a forthcoming paper, code, model, and asked the audience to join their Telegram channel or subscribe to their newsletter or subreddit.