DeepMind researchers have unveiled a new model, PaliGemma, pushing forward the evolution of vision-language models. The new model successfully integrates the strengths of both the PaLI vision-language model series and the Gemma family of language models. PaliGemma is an example of a sub-3B vision-language model that uses a 400M SigLIP vision model along with a 2B Gemma language model, offering performance comparable to much larger models.
The architecture of PaliGemma consists of a SigLIP ViTSo400m image encoder, a Gemma-2B v1.0 decoder-only language model, and a linear projection layer. This innovative design enables the model to handle a variety of tasks such as image classification, captioning, and visual question-answering via an image + text in, text out API.
The training process of PaliGemma involves several stages, starting with the unimodal pretraining of each component. It then uses a diverse array of tasks for multimodal pretraining with the image encoder remaining unfrozen to improve spatial and relational understanding. The model then undergoes a resolution increase stage to better handle high-resolution images, followed by a transfer stage adapting the model for specific tasks.
PaliGemma has recorded impressive performance across various visual-language tasks, showing excellent results in benchmarks such as COCO-Captions and TextCaps for image captioning and solid performance in various datasets for visual question answering such as VQAv2, GQA and ScienceQA. PaliGemma also performs well in specialized tasks, including chart understanding and OCR-related tasks.
PaliGemma has shown the ability of square resizing to perform on par with complex aspect-ratio preserving techniques for segmentation tasks. The researchers also introduced CountBenchQA, a new dataset addressing TallyQA’s limitations for assessing VLM’s counting abilities. Interestingly, PaliGemma has demonstrated zero-shot generalization to 3D renders from Objaverse without any previous training, showcasing the model’s versatility.
This research makes an important contribution to the field of visual-language understanding by demonstrating that smaller models like PaliGemma can achieve state-of-the-art performance across a wide range of tasks. The researchers plan to provide the base model to allow further study into instruction tuning and specific applications while challenging the prevailing belief that larger models are always superior. The hope is that PaliGemma will act as a stepping stone for more efficient and versatile AI systems.