Self-supervised learning (SSL) has broadened the application of speech technology by minimizing the requirement for labeled data. However, the current models only support approximately 100-150 of the over 7,000 languages in the world. This is primarily due to the lack of transcribed speech and the fact that only about half of these languages have formal writing systems. Moreover, few of these languages have enough resources to produce the extensive annotated data needed for training. SSL models can operate with unlabeled data, but they are generally limited to a narrow range of languages.
On the other hand, XEUS (Cross-lingual Encoder for Universal Speech) has improved multilingual SSL models by expanding to 4,057 languages. This surpasses the language coverage of models like Meta’s MMS as well as others. XEUS has incorporated a unique dereverberation objective during training to handle noisy and diverse speech. Unlike other state-of-the-art models that often lack transparency and use closed datasets, XEUS is fully open. It releases its data, training code, and thorough documentation to the public, encouraging further research into large-scale, multilingual SSL.
XEUS was pre-trained using a massive dataset of 1.081 million hours across 4,057 languages. It was compiled from varied sources such as Global Recordings Network, WikiTongues, Jesus Dramas, and 37 public speech datasets. Unique data types, like accented speech and code switching, enhance its robustness. The training of XEUS occurred on 64 NVIDIA A100 GPUs and used advanced augmentation techniques, covering a much more comprehensive range of data than previous models.
In evaluations, the XEUS model has excelled in multilingual speech tasks. It has surpassed state-of-the-art models such as XLS-R, MMS, and w2v-BERT in benchmarks like ML-SUPERB and FLEURS. This achievement is particularly impressive considering that many of the languages it was trained in are low-resource. XEUS also performed well in task universality, showing strong performance in English-only tasks. In terms of acoustic representation, XEUS beats models like WavLM and w2v-BERT in producing high-quality speech, as indicated by high MOS and WER metrics.
XEUS is a robust SSL speech encoder trained on over 1 million hours of data from 4,057 languages, demonstrating unprecedented performance across a broad range of multilingual and low-resource tasks. Its dereverberation function enhances its robustness, and even with limited data for many languages, it still produces valuable results. By opening access to its data and model, XEUS advances multilingual research.
However, ethical considerations surrounding the handling of speech data from indigenous communities and the prevention of misuse like generating audio deepfakes must be regarded. The team behind XEUS works continuously to merge the model with accessible platforms aiming to democratize speech model development.
The credit for this research goes to the researchers from Carnegie Mellon University, Shanghai Jiaotong University, and the Toyota Technological Institute in Chicago. The new model, training configurations, checkpoints, and training logs will be released for further research.