Skip to content Skip to footer

Tango 2: Pioneering the Future of Text-to-Audio Conversion and Its Outstanding Performance Indicators

The increasing demand for AI-generated content following the development of innovative generative Artificial Intelligence models like ChatGPT, GEMINI, and BARD has amplified the need for high-quality text-to-audio, text-to-image, and text-to-video models. Recently, supervised fine-tuning-based direct preference optimisation (DPO) has become a prevalent alternative to traditional reinforcement learning methods in lining up Large Language Model (LLM) responses with human preferences.

A group of researchers used the DPO-diffusion approach to better align the output audio of a text-to-audio model with input prompts. To achieve this, they utilised DPO-diffusion loss to improve Tango, a publicly available text-to-audio latent diffusion model, using a synthesized reference dataset known as Audio-Alpaca. This dataset includes many audio cues, complete with liked and disliked sounds. The undesired sounds either lacked certain concepts, had an incorrect temporal order, or had excessive noise levels, whereas the preferred sounds successfully encapsulated their corresponding written descriptions.

The team expanded Audio-Alpaca to cater to the preference pairs that arose from automatic synthesis, which they achieved by picking a subset of data based on criteria provided by CLAP-score differentials. As a result, Tango was successfully fine-tuned to create Tango 2, a model that significantly outperforms its predecessor and AudioLDM2 based on both human and objective evaluations.

Tango 2’s performance also highlighted the potential of Diffusion-DPO to enhance text-to-audio models and its relevance in task generation. The researchers’ primary contributions include a cost-effective methodology to semi-automatically produce a preference dataset for text-to-audio conversion, which would aid in model training. They also made the Audio-Alpaca dataset available to the research community for future benchmarking and development of text-to-audio generation methods. Most notably, they demonstrated improvement in performance with Tango 2, which only used the same dataset as its predecessor, proving the efficacy of the methodology used.

Leave a comment

0.0/5