Deep learning models have significantly affected the evolution of audio classification. Originally, Convolutional Neural Networks (CNNs) monopolized this field, but it has since shifted to transformer-based architectures that provide improved performance and unified handling of various tasks. However, the computational complexity associated with transformers presents a challenge for audio classification, making the processing of long audio sequences inefficient.
Audio Spectrogram Transformers (ASTs) have been the main method of audio classification, using self-attention to capture the global context in audio data. However, these result in high computational costs, and state space models (SSMs) have been explored as more efficient options. SSMs including Mamba have shown promise in language and vision tasks, but have not been widely adopted in audio classification.
Researchers from the Korea Advanced Institute of Science and Technology have introduced a new model called Audio Mamba (AuM), offering a novel self-attention-free model. This model processes audio spectrograms efficiently, eliminating the computational burden associated with self-attention and improving efficiency while maintaining high performance. Furthermore, it can efficiently handle long sequences without the quadratic scaling of transformers.
AuM functions by converting input audio waveforms into spectrograms which are divided into patches. These patches are converted into embedding tokens and processed using bidirectional state space models, capturing the global context efficiently and improving processing speed and memory usage. The architecture incorporates innovative design choices, enhancing the model’s understanding of the spatial structure of the input data.
Audio Mamba demonstrated competitive performance across various benchmarks, significantly outperforming AST in tasks involving long audio sequences. Tests showed that AuM requires significantly less memory and processing time, consuming equivalent memory to AST’s smaller model while providing superior performance. The inference time was 1.6 times faster than AST’s at a token count 4096, indicating its efficiency in handling long sequences.
The introduction of Audio Mamba represents a significant advancement in audio classification, providing a viable alternative for processing long audio sequences. Researchers believe AuM’s approach may pave the way for future developments in audio and multimodal learning applications. The model’s capability to handle lengthy audio will be increasingly crucial with the rise of self-supervised multimodal learning and generation, and automatic speech recognition. Additionally, AuM could be employed in self-supervised learning setups such as Audio Masked Auto Encoders or multimodal learning tasks, contributing to the advancement of the audio classification field.