Skip to content Skip to footer

The IXC-2.5, also known as InternLM-XComposer-2.5, is a flexible wide-range language model that can handle extended contextual input and output.

Large Language Models (LLMs) have seen substantial progress, leading researchers to focus on developing Large Vision Language Models (LVLMs), which aim to unify visual and textual data processing. However, open-source LVLMs face challenges in offering versatility comparable to proprietary models like GPT-4, Gemini Pro, and Claude 3, primarily due to limited diverse training data and difficulties in managing long-context inputs and outputs.

Addressing these issues, researchers have explored several approaches, including text-image conversation models, high-resolution image analysis techniques, and video understanding methods. There are also strategies to improve webpage generation and ensure models align better with human preferences.

A significant breakthrough is the development of InternLM-XComposer-2.5 (IXC-2.5) by a collective research effort from Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, SenseTime Group, and Tsinghua University. IXC-2.5 stands out with its versatility and extended long-context capabilities. It excels in a range of comprehension and composition tasks such as free-form text-image conversations, OCR, video understanding, crafting webpages and articles.

IXC-2.5 incorporates three comprehension upgrades: ultra-high resolution understanding, fine-grained video analysis, and multi-turn multi-image dialogue support. It can support a 24K interleaved image-text context window, extendable to 96K, facilitating long-term human-AI interaction and content creation.

The architecture of IXC-2.5 uses a unique combination of a ViT-L/14 Vision Encoder, InternLM2-7B Language Model, and Partial LoRA. With this, it processes diverse inputs using a Unified Dynamic Image Partition strategy at a 560 x 560 resolution, with 400 tokens per sub-image. The model also employs a scaled identity strategy for high-resolution images and treats videos as concatenated frames. Further, it offers audio input/output support, using Whisper for transcription and MeloTTS speech synthesis.

Performance-wise, IXC-2.5 exhibits exceptional results across benchmarks. In video understanding, it outperforms most open-source models, and for high-resolution tasks, it competes well with larger models. It exhibits significant enhancements in multi-image multi-turn comprehension, surpassing previous models by 13.8% on the MMDU benchmark. In visual QA tasks, IXC-2.5 matches or even surpasses both open and closed-source models.

In conclusion, IXC-2.5 presents a groundbreaking improvement in LVLMs, displaying exceptional capabilities in handling long-contextual input and output. Despite running on a moderate 7B Large Language Model backend, it delivers competitive performance across different tests. Its architecture and achievements pave the path for future research into more contextual multi-modal environments, thus marking a significant milestone in multimodal AI technology.

Leave a comment

0.0/5