Multi-modal generative models combine diverse data formats such as text, images, and videos to enhance artificial intelligence (AI) applications across various fields. However, the challenges in their optimization, particularly the discord between data and model development approaches, hinder progress. Current methodologies either focus on refining model architectures and algorithms or advancing data processing techniques, limiting the potential for collaborative optimization of both these aspects.
To address these issues, researchers from Alibaba Group have introduced the Data-Juicer Sandbox, an open-source suite. This platform bridges the gap between data processing and model training and facilitates their co-development by providing customizable components, thereby enabling systematic exploration and optimization.
The sandbox implements a “Probe-Analyze-Refine” workflow, enabling the testing and refinement of various data processing operators (OPs) and model configurations. The researchers employed a hierarchical data pyramid for categorizing data pools based on the results of their corresponding model metric scores. This stratified approach helps identify effective OPs that are further combined into data recipes and scaled up. Moreover, keeping hyperparameters consistent and utilizing cost-effective strategies ensures the process is efficient and resource-conscious.
The Data-Juicer Sandbox has tested its efficacy in several tasks with remarkable improvements. For instance, using this sandbox for image-to-text generation, the average performance on TextVQA, MMBench, and MME increased by 7.13%. It also held the top spot on the VBench leaderboard for the text-to-video generation task using the EasyAnimate model, outperforming strong competitors.
Furthermore, the sandbox was also instrumental in two practical scenarios: image-to-text and text-to-video generation. For image-to-text task, the Mini-Gemini model in the sandbox demonstrated solid performance in understanding image content, and for the text-to-video task, the EasyAnimate model showcased its capacity to generate high-quality videos from text-based descriptions.
In conclusion, the Data-Juicer Sandbox by Alibaba is a significant leap forward in the field of AI. It provides a comprehensive solution for optimizing multi-modal generative models by systematically integrating data processing with model training. The open-source platform paves the way for considerable advancements in AI performance, demonstrating its versatility and efficiency in the co-development of multi-modal data and generative models.