Visual Language Models (VLMs) have proven instrumental in tasks such as image captioning and visual question answering. However, the efficiency of these models is often hampered by challenges such as data scarcity, high curation costs, lack of diversity, and noisy internet-sourced data. To combat these setbacks, researchers from Google DeepMind have introduced Synth2, a method that generates synthetic paired data, thereby improving VLM training.
Synth2 leverages pre-trained generative text and image models, thereby eliminating reliance on real-world data. Operating at the embedding level, this method mitigates time and resource-intensive pixel-space rendering, ensuring that performance and efficiency are uncompromised. Through integrating data-driven generative models within VLMs, Synth2 promises superior results over existing approaches.
One of the key features of Synth2 is its utilization of pre-trained generative text and image models, ensuring a diverse and ample amount of information is readily available for processing. By training the text-to-image model on the same dataset used for VLM training, the researchers ensure a fair evaluation while preventing unintended knowledge transfer. Ensuring efficiency, this model directly integrates image embeddings into the VLM, eliminating the need for additional layers of processing.
Synth2 includes two pivotal components for data creation: Caption Generation and Image Generation. The former utilizes Language Models with class-based prompting to create diverse captions for image recognition. The latter, on the other hand, employs a text-to-image generator trained on the same dataset as the VLM to ensure fair evaluations.
Through VQ-GAN backbones, the Synth2 VLM architecture interacts efficiently with synthetically generated image embeddings, bypassing pixel-space processing and granting seamless training. Lastly, the included Perceiver Resampler component aids cross-attention between VQ tokens and language tokens in the VLM, facilitating effective multimodal representations.
In comparison to alternative methods like ITIT and DC, the results obtained through Synth2 outshine the other methods. Performance improved with a reduced amount of human-annotated images, suggesting that synthetic images can create a worthy substitute. This finding proves especially significant considering its reduced demand for data usage and computational resources.
In conclusion, researchers from Google DeepMind have presented Synth2 as a promising solution to augment VLM performance. By offering improved data efficiency and scalability alongside customization possibilities for specific domains, this method tackles challenges associated with resource-intensive data acquisition. These findings serve to highlight the potential of synthetic data in visual language understanding and offer new directions for research exploration.