Text-to-music generation using diffusion models possesses a significant hurdle: controlling pre-trained models at inference time. These models, while efficient, are complex and often require refined tuning to produce stylized musical outputs. The issue arises most prominently in complex audio tasks.
Computer-generated music research has made leaps in recent times. While language model-based methods generate audio sequentially, diffusion models create frequency-domain audio representations. Text, commonly used for control in diffusion models, needs better regulation which can be managed by fine-tuning present models or introducing external rewards. Notably, pre-trained classifiers used for guidance prove limited in expressiveness and efficiency. In the face of these challenges, a solution for precise and efficient music generation is needed.
Researchers from the University of California, San Diego, and Adobe Research have proposed the ‘Diffusion Inference-Time T-Optimization’ (DITTO) framework. This new method controls pre-trained text-to-music diffusion models seeking specific, stylized outputs by optimizing initial noise latents at inference time and making use of gradient checkpointing for memory efficiency. DITTO can be used across different music generation tasks.
For enhancing DITTO, the researchers used a dataset containing 1,800 hours of licensed instrumental music tagged by genre, mood, and tempo. Their study also exploited handcrafted intensity curves and musical structure matrices and the Wikifonia Lead-Sheet Dataset for melody control.
Measuring DITTO’s performance utilized the MusicCaps Dataset, consisting of 5K clips with text descriptions. The Frechet Audio Distance (FAD) with a VGGish backbone and the CLAP score was used to compare the generated music with baseline recordings and text captions. The results establish DITTO as outperforming other methods like MultiDiffusion, FreeDoM, and Music ControlNet in control, audio quality, and computational efficiency.
DITTO establishes further progress in text-to-music generation, introducing a flexible and efficient way to manage pre-trained diffusion models in creating stylized music. As it can fine-tune outputs without comprehensive retraining or expansive datasets, it marks a major advancement in music generation technology.
The authors of this research deserve full credit. Engage with us on Twitter and join our ML SubReddit, Facebook Community, Discord Channel, and LinkedIn Group.
For those who liked our work, we invite you to our newsletter and Telegram Channel. The complete study and project can be accessed via the respective links.