Skip to content Skip to footer

Researchers from CMU Launch a Superior, Speedier Open Whisper-Style Speech Model, OWSM v3.1, Enhanced with E-Branchformer Technology

Speech recognition technology has become essential in various applications, helping machines to recognize and process human speech. Achieving accurate recognition across different languages and dialects is challenging due to factors like accents, intonation, and background noise.

Various methods have been tried to enhance speech recognition systems, including the use of complex architectures like Transformers, which are effective but limited in their processing speed and ability to recognize a wide range of speech nuances.

The Carnegie Mellon University and Honda Research Institute Japan research team introduced a new model, OWSM v3.1, built on the E-Branchformer architecture, which overcomes these hurdles. OWSM v3.1 is an Open Whisper-style Speech Model, faster and more accurate than its predecessor.

The previous OWSM v3 and Whisper both used the Transformer encoder-decoder architecture. However, advancements in encoders like Conformer and Branchformer led to better performance in speech processing tasks, thus the E-Branchformer has been chosen as the encoder in OWSM v3.1. The new model neglects the WSJ training data used in the former version which had fully uppercased transcripts. This exclusion led to a significantly lower Word Error Rate (WER) and an up to 25% faster inference speed in OWSM v3.1.

OWSM v3.1 provided significant improvements in performance metrics than its predecessor, with higher speech recognition accuracy levels across different languages. The improvements are visible in English-to-X translation in 9 out of 15 directions, with a slightly improved average BLEU score from 13.0 to 13.3, despite minor degradations in some directions.

In sum, the research took significant steps towards improving speech recognition technology. OWSM v3.1, built on E-Branchformer architecture, improves accuracy and efficiency, setting a new standard for open-source speech recognition. The researchers shared the model and training details publicly, demonstrating their commitment to transparency and open science, providing crucial contributions for future advancements in speech recognition technology.

The research was publicly credited to the researchers behind this project. The community is encouraged to explore the research paper and demo, follow the discussion on various social media platforms such as Twitter, Google News, a ML SubReddit with 36k+ members, a 41k+ Facebook Community, Discord Channel, and LinkedIn Group. The team also urges interested parties to join theirTelegram Channel and subscribe to their newsletter for updates.

Leave a comment

0.0/5