Large language models (LLMs) have been successful in areas like natural language tasks and following instructions, yet they have limitations when dealing with non-textual data such as images and audio. But presently, an approach integrating textual LLMs with speech encoders in one training setup could revolutionize this. One option is multimodal audio-language models, proving advantageous for their capacity to be generalized across a variety of tasks.
In order to improve the optimization of these skills, multi-task learning is employed. It takes advantage of shared representations across different tasks to increase efficiency and generalization. Even though models like T5 and SpeechNet have used this approach for text and speech tasks with significant results, large language models integrating audio have not gained as much attention yet.
However, the new SpeechVerse framework developed by Amazon researchers may be a game-changer. It is a multi-task framework with supervised instruction finetuning designed specifically for diverse speech tasks. Unlike the other models, SpeechVerse uses continuous representations from pre-trained speech models for text-only output tasks, and also incorporates multi-task learning and finetuning without a need for task-specific tagging.
The architecture of SpeechVerse includes an audio encoder, a convolution downsampling module, and a Large Language Model (LLM). The encoder extracts features from the audio using a pre-trained model which generates a unified representation. The downsampling module then adjusts the audio features for compatibility with LLM token sequences. By processing text and audio input, combining downsampled audio features with token embeddings, and using curriculum learning with parameter-efficient finetuning, this model is able to competently handle a wide array of speech tasks.
During an evaluation of end-to-end trained joint speech and language models (E2E-SLM) using the SpeechVerse framework, it was put to the test on 11 tasks spanning various domains and datasets. The results showed promising results, particularly models showing the efficacy of SpeechVerse’s core speech understanding, while for SLU tasks, the end-to-end trained models performed better than cascaded pipelines in most cases.
In summary, SpeechVerse is a framework introduced by Amazon researchers enabling LLMs to handle diverse speech processing tasks through natural language instructions. By using supervised instruction finetuning and combining representations from pre-trained speech and text models, SpeechVerse shows great potential across a variety of unseen tasks. It demonstrated superior performance in 9 out of 11 tasks in comparison with other baselines, showing its ability to follow instructions robustly. Achieving good results across out-of-domain datasets, unseen prompts, and novel tasks, SpeechVerse demonstrates the effectiveness of its proposed training approach to boost generalizability.