Apple researchers are implementing cutting-edge technology to enhance interactions with virtual assistants. The current challenge lies in accurately recognizing when a command is intended for the device amongst background noise and speech. To address this, Apple is introducing a revolutionary multimodal approach.
This method leverages a large language model (LLM) to combine diverse types of data, including acoustic signals, linguistic information, and outputs from automatic speech recognition (ASR) systems to detect device-directed speech. The chinks in earlier methods, such as faltering in noisy environments or ambiguous speech scenarios, are ironed out in this new approach.
The technical aspect involves training classifiers using acoustic information derived from audio waveforms. Decoder outputs of an ASR system, including hypotheses and lexical elements, are subsequently used as inputs to the LLM. The final stage merges these acoustic and lexical characteristics with ASR decoder signals to create a robust framework for speech directed at a device.
The success of the multimodal system is highlighted in its performance metrics – specifically, it achieves an equal error rate (EER) reductions of up to 39% and 61% over text and audio-only models. By amplifying the LLM’s size and using low-rank adaptation techniques, these EER reductions were further enhanced to 18% on their dataset.
With this research, Apple is setting a new industry standard for interactions with virtual assistants. The most significant achievement is an EER of 7.95% with the Whisper audio encoder and 7.45% with the CLAP backbone. This groundbreaking work showcases the potential of merging text, audio, and decoder signals from an ASR system, edging closer to a future where virtual assistants can comprehend and respond to user commands without explicit trigger phrases.
Combining multimodal information and advanced processing capabilities provided by large language models, this research is a stepping stone to the next generation of virtual assistants. In essence, Apple’s research is geared towards making interactions with devices as intuitive as human-to-human communication, which could fundamentally change our relationship with technology. Finally, all credit for this research goes to the investigators of the project, and you can access their paper to learn more about their research.