Automatic Speech Recognition (ASR) systems have undergone significant enhancements in recent years, with a novel approach from Apple, known as Acoustic Model Fusion (AMF), showing particularly promising results. The AMF technique integrates an external Acoustic Model (AM) into End-to-End (E2E) ASR systems, addressing a common problem in speech recognition technology – the issue of domain mismatch.
E2E ASR systems consolidate all necessary speech recognition components within a single neural network, which simplifies the process and allows the system to predict character or word sequences directly from audio input. However, this model encounters limitations when addressing rare or complex words underrepresented in its training data. In an effort to address this, previous research has mainly focused on incorporating external Language Models (LMs) to expand system vocabulary.
AMF offers a new solution by integrating an external AM with the E2E system, embracing broader acoustic knowledge and significantly reducing Word Error Rates (WER). The method combines scores from the external AM and the E2E system, resulting in a significant improvement in the system’s performance, especially in recognizing rare words and named entities.
Through a series of experiments using diverse datasets, AMF demonstrated considerable efficacy, recording a significant reduction in WER – up to 14.3% across different test sets. This demonstrates AMF’s potential to boost ASR system accuracy and reliability, with results showing improved recognition of rare words and named entities, illustrating the method’s potential to expand system vocabulary and adaptability. Notably, AMF outperforms traditional LM integration methods.
The outcomes of this study present a pivotal advancement in speech recognition technology, suggesting a more accurate, efficient, and adaptable future for ASR systems. The success of AMF in reducing domain mismatches and improving word recognition presents exciting possibilities for using ASR technology across various domains. This research is instrumental in paving the way for enhanced human-computer interaction through speech.