DeepMind researchers have developed an open vision-language model called PaliGemma, blending the strengths of the PaLI vision-language model series with Gemma family of language models. This model merges a 400 million SigLIP vision model with a 2 billion Gemma language model, creating a compact vision-language model that can compete with larger predecessors such as PaLI-X, PaLM-E, and PaLI-3. The Gemma component, which is the same tech that powers the Gemini models, adds an auto-regressive decoder-only architecture that enhances PaliGemma’s capabilities.
The architecture of PaliGemma consists of a SigLIP ViTSo400m image encoder, a Gemma-2B v1.0 decoder-only language model, and a linear projection layer. The image encoder converts input images into a stream of tokens, while the language model processes text using its SentencePiece tokenizer. The projection layer makes the dimensions of image and text tokens compatible, facilitating their concatenation. This simple yet effective design allows PaliGemma to perform image classification, captioning, and visual question-answering tasks through a flexible image+text in, text out API.
PaliGemma requires several stages of training to guarantee a thorough understanding of visual-language. The process begins with unimodal pretraining of the individual elements, followed by multimodal pretraining on a varied mixture of tasks. Notably, the image encoder is not frozen during this stage, allowing for enhanced spatial and relational understanding. The training progresses with a resolution increase stage to improve PaliGemma’s ability to handle high-resolution images and complex tasks. Lastly, a transfer stage adapts the base model to specific tasks or use cases, proving the versatility and effectiveness of PaliGemma across a range of applications.
PaliGemma excels in image captioning and has strong performance in visual question answering. It can handle complex tasks such as chart understanding and OCR-related tasks. It displays significant improvement when the image resolution is increased from 224px to 448px and 896px, particularly for jobs involving intricate details or text recognition. Furthermore, it can handle video input tasks and image segmentation challenges.
Some significant findings from the PaliGemma research include the observation that simple square resizing performs as effectively as complicated aspect-ratio preserving techniques for segmentation tasks, and that image annotations, akin to red boxes, are as effective as textual cues for indicating widgets to be captioned. Additionally, PaliGemma displays unexpected zero-shot generalization to 3D renders from Objaverse without specific training and holds state-of-the-art performance on MMVP, significantly outperforming larger models like GPT4-V and Gemini.
PaliGemma represents an advanced yet compact open-base VLM that excels in transfer learning across a range of tasks. This research proves that smaller VLMs can achieve state-of-the-art performance on diverse benchmarks, and challenges the idea that larger models are inherently superior. The base model’s release aims to provide a platform for further studies in instruction tuning and specific applications, potentially opening new pathways for more efficient and flexible AI systems in visual-language understanding.