Scientists from Intel Labs have unveiled LLaVA-Gemma, a compact vision-language module utilizing two versions of the Gemma Large Language Model, namely Gemma-2B and Gemma-7B.

Recent advancements in large language models (LLMs) and Multimodal Foundation Models (MMFMs) have sparked a surge of interest in large multimodal models (LMMs). LLMs and MMFMs, including models such as GPT-4 and LLaVA, have demonstrated exceptional performance in vision-language tasks, including Visual Question Answering and image captioning. However, these models also require high computational resources, prompting a shift of focus to smaller, more efficient LMMs.

In response to this need, researchers from Cognitive AI and Intel Labs have introduced LLaVA-Gemma, a suite of vision-language assistants based on the Gemma LLM variants; Gemma-2B and Gemma-7B. These assistants are inspired by progress in smaller, yet efficient visual language models (VLMs), like LLaVA-Phi. The development of LLaVA-Gemma allows researchers to explore the balance between computational efficiency and the depth of visual-linguistic understanding in a more controlled environment.

LLaVA-Gemma follows the LLaVA framework’s blueprint, but is adapted to incorporate a pretrained vision encoder (such as CLIP) and a pretrained language model (such as Gemma) via a multilayer perceptron (MLP) connector. The model undergoes two training stages – pretraining the MLP connector on a custom dataset, then finetuning the language model and the connector jointly on multimodal instruction tuning examples. Variations of the original LLaVA framework include using Gemma models for the language backbone, using the larger DINOv2 image encoder for vision, and potentially omitting the initial pretraining stage to enhance performance.

When compared to the training and evaluation speeds of model sizes, the Gemma-2B model trained on eight Intel Gaudi 2® AI accelerators took four hours, while the larger Gemma-7B model took 16 hours under the same conditions. These results indicated that the Gemma-7B model, with its increased parameter count, takes roughly four times longer to train than the Gemma-2B model. This highlights the trade-off between model size and training efficiency, with larger models requiring significantly more computational resources and time.

This research not only introduces LLaVA-Gemma, leveraging compact and powerful Gemma language models for efficient multimodal interactions, but it also offers a comprehensive evaluation of the Gemma-2B and Gemma-7B model variants. The research provides valuable insights into the trade-offs between computational efficiency and richness of visual and linguistic understanding in LMMs. The researchers have also delved deeply into alternate design choices and visualized attention with relevancy maps to enhance their understanding of the model’s performance and attention distribution.

To sum it up, the research introduces LLaVA-Gemma, a compact vision-language model that harnesses the Gemma LLM in two variants, Gemma-2B and Gemma-7B. This research presents a unique opportunity to explore the balance between computational efficiency and multimodal understanding in small-scale models. These evaluations demonstrate the versatility and effectiveness of LLaVA-Gemma across a range of datasets, thereby highlighting its potential to set a benchmark for future research in small-scale vision-language models.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Scientists from Intel Labs have unveiled LLaVA-Gemma, a compact vision-language module utilizing two versions of the Gemma Large Language Model, namely Gemma-2B and Gemma-7B.

Leave a comment Cancel reply

You May Also Like

An In-depth Examination of Group Relative Policy Optimization (GRPO) Technique: Improving Mathematical Reasoning in Open Language Models

Free Access to GPT Conversation | A Comprehensive Manual on Utilizing OpenAI’s Sophisticated Language Model 2024

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Scientists from Intel Labs have unveiled LLaVA-Gemma, a compact vision-language module utilizing two versions of the Gemma Large Language Model, namely Gemma-2B and Gemma-7B.

Leave a comment Cancel reply

You May Also Like

An In-depth Examination of Group Relative Policy Optimization (GRPO) Technique: Improving Mathematical Reasoning in Open Language Models

Free Access to GPT Conversation | A Comprehensive Manual on Utilizing OpenAI’s Sophisticated Language Model 2024

+60 12-462 2768

All
Categories

All
Categories

All
Categories