Skip to content Skip to footer

Tango 2: The Emerging Frontier in Text-to-Audio Synthesis and Its Outstanding Performance Indicators

As demand for AI-generated content continues to increase, particularly in the multimedia realm, the need for high-quality, quick production models for text-to-audio, text-to-image, and text-to-video conversions has never been greater. An emphasis is placed on enhancing the realistic nature of these models in regard to their input prompts.

A novel approach to adjust Large Language Model (LLM) responses to match human preferences is the use of supervised fine-tuning-based direct preference optimization (DPO). This method, recently adapted for diffusion models, aligns denoised outputs with human appraisals.

Researchers have utilized the DPO-diffusion technique to enhance the semantic compatibility between the output audio and input prompts of a text-to-audio model. They optimized Tango, an open-to-public text-to-audio latent diffusion model, by leveraging DPO-diffusion loss on a synthesized reference dataset called Audio-Alpaca. This dataset encompasses a range of audio cues, inclusive of both favored and disliked sounds. The undesirable sounds feature issues such as overly noise levels, skipped concepts, or incorrect temporal arrangements, while the favored sounds accurately reflect their corresponding written descriptions.

The researchers selected a sub-set of data for DPO fine-tuning based on criteria determined by CLAP-score differentials to tackle noisy preference pairs emerging from automated synthesis. This ensures a minimum proximity to the input prompt as well as a minimum separation between the preference pairs.

The resultant model, Tango 2, had superior performance metrics, statistically outdoing both Tango and AudioLDM2 in objective and subjective evaluations. It exhibited a heightened ability to map input prompt semantics into the auditory domain, demonstrating the efficacy of DPO fine-tuning.

The research team’s key contributions are presenting cost-effective methodology for semi-automating a preference dataset for text-to-audio conversion, producing Audio-Alpaca dataset made accessible to the research community, proving the applicability of Diffusion-DPO through Tango 2’s performance, and demonstrating Tango 2’s superior performance metrics in both objective and subjective measures. This indicates the significant potential of the technology in improving performance of text-to-audio models and its utility in tasks related to audio generation.

Leave a comment

0.0/5