Recent advancements in large language models (LLMs) and Multimodal Foundation Models (MMFMs) have sparked a surge of interest in large multimodal models (LMMs). LLMs and MMFMs, including models such as GPT-4 and LLaVA, have demonstrated exceptional performance in vision-language tasks, including Visual Question Answering and image captioning. However, these models also require high computational resources, prompting a shift of focus to smaller, more efficient LMMs.
In response to this need, researchers from Cognitive AI and Intel Labs have introduced LLaVA-Gemma, a suite of vision-language assistants based on the Gemma LLM variants; Gemma-2B and Gemma-7B. These assistants are inspired by progress in smaller, yet efficient visual language models (VLMs), like LLaVA-Phi. The development of LLaVA-Gemma allows researchers to explore the balance between computational efficiency and the depth of visual-linguistic understanding in a more controlled environment.
LLaVA-Gemma follows the LLaVA framework’s blueprint, but is adapted to incorporate a pretrained vision encoder (such as CLIP) and a pretrained language model (such as Gemma) via a multilayer perceptron (MLP) connector. The model undergoes two training stages – pretraining the MLP connector on a custom dataset, then finetuning the language model and the connector jointly on multimodal instruction tuning examples. Variations of the original LLaVA framework include using Gemma models for the language backbone, using the larger DINOv2 image encoder for vision, and potentially omitting the initial pretraining stage to enhance performance.
When compared to the training and evaluation speeds of model sizes, the Gemma-2B model trained on eight Intel Gaudi 2® AI accelerators took four hours, while the larger Gemma-7B model took 16 hours under the same conditions. These results indicated that the Gemma-7B model, with its increased parameter count, takes roughly four times longer to train than the Gemma-2B model. This highlights the trade-off between model size and training efficiency, with larger models requiring significantly more computational resources and time.
This research not only introduces LLaVA-Gemma, leveraging compact and powerful Gemma language models for efficient multimodal interactions, but it also offers a comprehensive evaluation of the Gemma-2B and Gemma-7B model variants. The research provides valuable insights into the trade-offs between computational efficiency and richness of visual and linguistic understanding in LMMs. The researchers have also delved deeply into alternate design choices and visualized attention with relevancy maps to enhance their understanding of the model’s performance and attention distribution.
To sum it up, the research introduces LLaVA-Gemma, a compact vision-language model that harnesses the Gemma LLM in two variants, Gemma-2B and Gemma-7B. This research presents a unique opportunity to explore the balance between computational efficiency and multimodal understanding in small-scale models. These evaluations demonstrate the versatility and effectiveness of LLaVA-Gemma across a range of datasets, thereby highlighting its potential to set a benchmark for future research in small-scale vision-language models.