Cobra for Versatile Language Development: Streamlined Large-Scale Multimodal Language Models (MLLM) with a Linear Computation Complexity Level

The advancements in multimodal large language models (MLLMs) such as ChatGPT have proved revolutionary in several fields. However, these models primarily use Transformer networks, which have quadratic computation complexity, reducing efficiency. Language-Only Models (LLMs), on the other hand, are restricted in their adaptability as they solely rely on language interactions.

Attempting to improve this, researchers from Westlake University and Zhejiang University have developed Cobra, an MLLM that incorporates the Mamba language model. With its linear computational complexity, Cobra improves on current models like LLaVA-Phi and TinyLLaVA, offering faster speeds and competitive performance. The researchers plan to release Cobra’s code as open-source to aid future research in addressing MLLMs’ complexity issues.

LLMs have already revolutionized natural language processing, with larger models like GLM and LLaMA and smaller alternatives such as Stable LM and TinyLLaMA showing comparable efficacy. Vision Language Models (VLMs) like GPT4V and Flamingo extend LLMs to process visual data, but their complexity limits scalability. Vision Transformers like ViT and state space models like Mamba suggest a competitive alternative with linear scalability.

Cobra integrates Mamba’s selective state space model (SSM) with visual comprehension capabilities. It has a vision encoder merging DINOv2 and SigLIP representations, a projector that aligns visual and textual features, and the Mamba backbone processing visual and textual embeddings. These elements help generate target token sequences.

Cobra was tested across six benchmarks and demonstrated its efficacy in visual question-answering and spatial reasoning tasks. Inference speed was significantly faster than Transformer-based models, and it showed a strong understanding of spatial relationships and scene descriptions.

Conclusively, Cobra offers a solution to the efficiency challenges posed by MLLMs that use Transformer networks. It successfully optimizes the fusion of visual and linguistic information within the Mamba language model, all while enhancing computational efficiency. It also achieves competitive performance, particularly excelling in tasks involving visual hallucination mitigation and spatial relationship judgment. This paves the way for high-performance AI models in scenarios that require real-time visual information processing.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Cobra for Versatile Language Development: Streamlined Large-Scale Multimodal Language Models (MLLM) with a Linear Computation Complexity Level

Leave a comment Cancel reply

You May Also Like

Assessing The Luck Factor of a Bowl of Lucky Charms: A Perspective by G. Jay Kerns | Mar, 2024

Who should be responsible for funding Image Processing AI?

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Cobra for Versatile Language Development: Streamlined Large-Scale Multimodal Language Models (MLLM) with a Linear Computation Complexity Level

Leave a comment Cancel reply

You May Also Like

Assessing The Luck Factor of a Bowl of Lucky Charms: A Perspective by G. Jay Kerns | Mar, 2024

Who should be responsible for funding Image Processing AI?

+60 12-462 2768

All
Categories

All
Categories

All
Categories