Visual Language Models (VLMs), which are powerful tools for processing visual and textual data, can face difficulties due to limited data availability. Recent research developments have shown that pre-training these models on larger image-text datasets can enhance their performance in downstream tasks. However, creating these datasets can be challenging because of paired data scarcity, high curation costs, low diversity, and traits of noisy internet-sourced data.
To overcome these obstacles, researchers from Google DeepMind have proposed a new method called Synth2. This scheme operates by using pre-trained generative text and image models to create synthetic paired data for VLMs. Synth2 has been designed to address the hurdles of data scarcity, costs, and noise. It synthesizes both text and image data, getting rid of the dependence on real-world data. The approach works at the image embedding level, which sidesteps resource-heavy pixel-space rendering and boosts efficiency without undermining performance. The text-to-image model is pre-trained on the same dataset used for VLM training to ensure fair evaluation and inhibit unintended knowledge transfer.
Synth2’s VLM architecture undertakes efficient interaction with synthetic image embeddings, bypassing pixel-space processing and allowing seamless training. It employs language model machines (LLMs) with class-based prompting for diverse captions, and utilizes a controlled text-to-image generator trained on the same dataset as the VLM. Additionally, a Perceiver Resampler component aids with cross-attention between digital tokens and language tokens within the VLM, aiding in effective multimodal representations.
In testing synthetic images for VLM training, Synth2 significantly improved performance against base comparisons, even with fewer human-annotated images. It outperformed other state-of-the-art methods like ITIT and DC, achieving competitive results with a reduction in data usage and computational resources. This performance indicates Synth2’s effectiveness and efficiency in improving VLM performance.
In summary, Synth2 uses synthetic image-text pairs to boost VLM training, improving overall performance. It provides greater data efficiency and scalability compared to traditional methods and offers customization for specific domains. Its development and success in this field underline the potential for synthetic data generation to advance our understanding of visual language, suggesting paths for future investigations. All the research credit goes to the Google DeepMind team for the development of Synth2.