Recent advancements in language models reveal impressive capabilities for zero-shot voice conversion (VC) Nevertheless, conventional VC models that use language models usually involve offline conversion, which means they require the entirety of the source speech. This limits their suitability for real-time situations.
Researchers from Northwestern Polytechnical University in China in conjunction with ByteDance have introduced StreamVoice, a new language model-based strategy for zero-shot voice conversion allowing real-time conversion of any source speech. Through a fully causal context-aware language model and a temporal-independent acoustic predictor, StreamVoice achieves this streaming capability.
The model processes semantic and acoustic features alternatively at each regression time phase, removing the need for complete source speech. To avoid performance decline due to incomplete context in streaming processing, two methods are applied: 1) teacher-guided context foresight, where a teacher model gives a summary of the present and future semantic context during training and guides the model’s predictions for missing context, and 2) a semantic masking strategy to improve context-learning ability by encouraging acoustic prediction from previous corrupted semantic and acoustic input.
StreamVoice is the first language model-based streaming zero-shot VC model without any future look-ahead. Experimental outcomes demonstrate its streaming conversion capability while preserving a zero-shot performance equivalent to non-streaming VC systems.
Regarding the recognition-synthesis framework used in streaming zero-shot VC, StreamVoice is developed based on this widely-used paradigm. Experimental results show that StreamVoice can carry out speech conversion while maintaining high speaker similarity. It also maintains performance levels similar to non-streaming voice conversion systems. As the first language model-based zero-shot VC model that doesn’t require any future look-ahead, StreamVoice’s entire process incurs only 124 ms latency in the conversion process. This speed is considerably faster, about 2.4 times faster than real-time on a single A100 GPU.
Potential future improvements include using more training data to better StreamVoice’s modelling abilities. The team also aims to enhance the streaming pipeline, integrating a high-fidelity codec with a low bitrate and a comprehensive streaming model.
The purpose of StreamVoice is to provide research and information about artificial intelligence development. All credit goes to the researchers of this project. The comprehensive paper detailing this research can be found on the official MarkTechPost website.