Automated Audio Captioning (AAC) is a blossoming field of study that focuses on translating audio streams into clear and concise text. AAC systems are created with the aid of substantial and accurately annotated audio-text data. However, the traditional method of manually aligning audio segments with text annotations is not only laborious and costly but also inherently inconsistent and biased.
Past research within the AAC field has examined the use of encoder-decoder architectures to extract audio features from audio encoders such as PANN, AST, and HTSAT. Language generation components like BART and GPT-2 interpret these features. The CLAP model aims to improve this by employing contrastive learning to align audio and text data in multimodal embeddings. Various techniques, such as adversarial training and contrastive losses, are used to refine AAC systems. These techniques enhance the diversity and accuracy of the captions produced while also addressing the vocabulary limitations found in previous models.
Researchers from Microsoft and Carnegie Mellon University are proposing a new, innovative methodology for training AAC systems using the CLAP model. This novel technique uses only text data during training, which fundamentally transforms the traditional AAC training process. The system is capable of generating audio captions without needing to learn directly from audio inputs, representing a significant advancement in AAC technology.
In the proposed methodology, AAC systems are specifically trained using text data with the assistance of the CLAP model. The training phase involves a decoder conditioned on embeddings from a CLAP text encoder to generate captions. During the inference stage, the text encoder is replaced with a CLAP audio encoder to adapt the system to real audio inputs. Two datasets, AudioCaps and Clotho, were used to evaluate the model. The research team successfully bridged the modality gap between audio and text embeddings using a combination of Gaussian noise injection and a learnable adapter, which ensured that the system’s performance remained consistent.
The text-only AAC methodology evaluation demonstrated strong results. The model achieved a SPIDEr score of 0.456 on the AudioCaps dataset and 0.255 on the Clotho dataset. These scores indicated competitive performance with cutting-edge AAC systems trained with paired audio-text data. The model also used Gaussian noise injection and a learnable adapter to effectively bridge the modality gap. This was confirmed by reducing the variance in embeddings to roughly 0.015. These quantitative results validate the effectiveness of using text-only training to generate accurate and relevant audio captions.
In conclusion, using the CLAP model for a text-only training method in AAC eliminates the need for audio-text pairs. This approach, which utilizes text data to train AAC systems, simplifies AAC system development, enhances scalability, and reduces dependency on expensive data annotation processes. It shows significant promise in expanding the use and accessibility of audio captioning technologies. Future developments could see this innovative method used widely in the development and training of other AAC systems.