The advancements in multimodal large language models (MLLMs) such as ChatGPT have proved revolutionary in several fields. However, these models primarily use Transformer networks, which have quadratic computation complexity, reducing efficiency. Language-Only Models (LLMs), on the other hand, are restricted in their adaptability as they solely rely on language interactions.
Attempting to improve this, researchers from Westlake University and Zhejiang University have developed Cobra, an MLLM that incorporates the Mamba language model. With its linear computational complexity, Cobra improves on current models like LLaVA-Phi and TinyLLaVA, offering faster speeds and competitive performance. The researchers plan to release Cobra’s code as open-source to aid future research in addressing MLLMs’ complexity issues.
LLMs have already revolutionized natural language processing, with larger models like GLM and LLaMA and smaller alternatives such as Stable LM and TinyLLaMA showing comparable efficacy. Vision Language Models (VLMs) like GPT4V and Flamingo extend LLMs to process visual data, but their complexity limits scalability. Vision Transformers like ViT and state space models like Mamba suggest a competitive alternative with linear scalability.
Cobra integrates Mamba’s selective state space model (SSM) with visual comprehension capabilities. It has a vision encoder merging DINOv2 and SigLIP representations, a projector that aligns visual and textual features, and the Mamba backbone processing visual and textual embeddings. These elements help generate target token sequences.
Cobra was tested across six benchmarks and demonstrated its efficacy in visual question-answering and spatial reasoning tasks. Inference speed was significantly faster than Transformer-based models, and it showed a strong understanding of spatial relationships and scene descriptions.
Conclusively, Cobra offers a solution to the efficiency challenges posed by MLLMs that use Transformer networks. It successfully optimizes the fusion of visual and linguistic information within the Mamba language model, all while enhancing computational efficiency. It also achieves competitive performance, particularly excelling in tasks involving visual hallucination mitigation and spatial relationship judgment. This paves the way for high-performance AI models in scenarios that require real-time visual information processing.