Google has unveiled PaliGemma, a latest family of vision language models. These innovative models work by receiving both an image and text inputs, and generating text as output. The architecture of PaliGemma comprises of two components: an image encoder named SigLIP-So400m, and a text decoder dubbed Gemma-2B. SigLIP, which has the ability to understand both…
