Instruct-MusicGen, a new method for text-to-music editing, has been introduced by researchers from C4DM, Queen Mary University of London, Sony AI, and Music X Lab, MBZUAI. This new approach aims to optimize existing models that require significant resources and fail to deliver precise results. Instruct-MusicGen utilizes pre-trained models and innovative training techniques to accomplish high-quality music editing based on textual instructions.
Current text-to-music editing methods necessitate creating and training specific models from scratch, an inefficient and resource-heavy process. These models often lead to imprecise audio reconstruction, making them expensive to maintain and failing to deliver accurate results. To address these shortcomings, researchers introduced Instruct-MusicGen, which makes use of a pre-trained MusicGen model and fine-tunes it to follow editing instructions efficiently.
This novel approach incorporates a text fusion module and an audio fusion module into the original MusicGen architecture. This feature enables the model to process instruction texts and audio inputs simultaneously. With the help of these modifications, Instruct-MusicGen eliminates the need for extensive training and extra parameters while maintaining superior performance across various music editing tasks.
Instruct-MusicGen improves the original MusicGen model by integrating two new modules. The audio fusion module allows the model to accept and process external audio inputs for accurate audio editing. The text fusion module alters the text encoder’s operation to handle instruction inputs, enabling the model to follow text-based editing commands effectively. These combined modules equip Instruct-MusicGen to add, delete, and separate stems from music audio based on textual instructions.
The model was trained with a synthesized dataset created from the Slakh2100 dataset, which contains high-quality audio tracks and corresponding MIDI files. This training required only 8% additional parameters compared to the original MusicGen model, significantly reducing the resources used. Instruct-MusicGen’s performance was evaluated on both the Slakh test set and the MoisesDB dataset. It outperformed current benchmarks in several tasks, demonstrating its efficiency and effectiveness in text-to-music editing.
In summary, Instruct-MusicGen successfully addresses the issues associated with existing text-to-music editing techniques by harnessing pre-trained models and recommending effective training techniques. Although some limitations remain, such as reliance on synthetic training data and potential inaccuracies in the signal-level precision, the introduction of Instruct-MusicGen signifies a major advancement in the realm of AI-assisted music creation, bringing together efficiency and high performance.