Google has unveiled PaliGemma, a latest family of vision language models. These innovative models work by receiving both an image and text inputs, and generating text as output. The architecture of PaliGemma comprises of two components: an image encoder named SigLIP-So400m, and a text decoder dubbed Gemma-2B. SigLIP, which has the ability to understand both image and text, is similar to CLIP, a joint-trained image and text encoder. Gemma, on the other hand, is a text-generating model that requires a decoder. PaliGemma achieves its optimal functionality by integrating Gemma and SigLIP’s image encoder through a linear adapter.
The PaliGemma models are part of Big_vision, a training codebase that has previously been used to develop a range of models, including CapPa, SigLIP, LiT, BiT, and the ViT. This new release offers three unique model types: PT checkpoints, blend checkpoints, and FT checkpoints. PT checkpoints are pretrained models that are highly adaptable and can be used for a diversity of tasks. Blend checkpoints are variant of PT models that are adjusted for different tasks and are only good for general-purpose inference and research. Lastly, FT checkpoints are collections of models that are refined for different academic standards and are available in diverse resolutions.
Each model in PaliGemma is stored under a specific repository based on the task and resolution it performs, and are also available at three different precision levels – bfloat16, float16, and float32 – and three resolution levels: 224×224, 448×448, and 896×896. Users, depending on their needs and usage, can choose from any of these variously precise and resolved models. But attention should be paid to the high-resolution versions as they consume more memory due to their longer input sequences and thus might not prove suitable for people with limited resources.
Despite its rich capabilities, PaliGemma is not a jack-of-all-trades. It is a single-turn visual language model that performs optimally when it is tuned for specific use cases, and is thus not suitable for conversational use. Users, however, can still specify the task for which they intend the model to perform by qualifying it with prefixes.
Capabilities of PaliGemma include adding captions to photos, responding to inquiries about images, detecting objects in images, segmenting entities within images, generating in-depth document understanding, and so on. It excels in multiple tasks but needs fine-tuning for specific tasks, which is why it comes pre-loaded with the ‘mix’ family of models, fine-tuned on various tasks.
As it stands, Google plans on developing its capabilities further to serve more diverse needs in the future.