The development of technology in the field of speech recognition has seen continual advancements, yet factors like latency time delays in processing spoken language – have often presented hurdles. Such latency is particularly noticeable in autoregressive models, which process speech in a sequence, causing delays. These delays are problematic for real-time applications such as live captioning or virtual assistants, where immediate responses are crucial. The challenge remains to address this latency without losing accuracy in order to progress speech recognition technology.
Research by Google has proposed a non-autoregressive model, deviating from traditional methods, to address the latency issues present in current systems. This model uses large language models and parallel processing, processing speech segments simultaneously rather than sequentially, to reduce latency and improve the user experience.
The model combines the Universal Speech Model (USM), designed for accurate speech recognition, with the PaLM 2 language model, which excels in natural language processing. The USM, which consists of 2 billion parameters and uses a vocabulary of 16,384-word pieces, uses a Connectionist Temporal Classification (CTC) decoder for parallel processing and is trained on over 12 million hours of unlabeled audio and 28 billion sentences of text data. Meanwhile, PaLM 2 has a 256,000 wordpiece vocabulary and scores Automatic Speech Recognition (ASR) hypotheses using a prefix language model scoring mode.
In operation, the system processes audio in 8-second chunks, with the USM encoding the audio and the CTC decoder creating a potential word piece lattice that the PaLM 2 model scores. The system refreshes every 8 seconds, providing near real-time responses.
This model’s effectiveness was evaluated across multiple languages and various data sets, including YouTube captioning and the FLEURS test set. It achieved an average improvement of 10.8% in relative word error rate (WER) on the multilingual FLEURS test set, and for the more challenging YouTube captioning data set, saw an average improvement of 3.6% across all languages.
The research also explored factors affecting the model’s performance, considering the impact of language model size, ranging from 128 million to 340 billion parameters. The findings suggested a balance between model complexity and computational efficiency.
In conclusion, this non-autoregressive model integrating the USM and PaLM 2, that delivers enhanced accuracy and speed suitable for real-time applications, represents a significant leap in speech recognition technology. The model, processing speech in parallel, with its ability to handle multilingual inputs efficiently, presents a promising solution for several real-world applications. It offers valuable insights into system parameters and their effects on ASR efficacy, paving the way for future advancements in speech recognition technology.