Large Language Models (LLMs) have seen substantial progress, leading researchers to focus on developing Large Vision Language Models (LVLMs), which aim to unify visual and textual data processing. However, open-source LVLMs face challenges in offering versatility comparable to proprietary models like GPT-4, Gemini Pro, and Claude 3, primarily due to limited diverse training data and difficulties in managing long-context inputs and outputs.
Addressing these issues, researchers have explored several approaches, including text-image conversation models, high-resolution image analysis techniques, and video understanding methods. There are also strategies to improve webpage generation and ensure models align better with human preferences.
A significant breakthrough is the development of InternLM-XComposer-2.5 (IXC-2.5) by a collective research effort from Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, SenseTime Group, and Tsinghua University. IXC-2.5 stands out with its versatility and extended long-context capabilities. It excels in a range of comprehension and composition tasks such as free-form text-image conversations, OCR, video understanding, crafting webpages and articles.
IXC-2.5 incorporates three comprehension upgrades: ultra-high resolution understanding, fine-grained video analysis, and multi-turn multi-image dialogue support. It can support a 24K interleaved image-text context window, extendable to 96K, facilitating long-term human-AI interaction and content creation.
The architecture of IXC-2.5 uses a unique combination of a ViT-L/14 Vision Encoder, InternLM2-7B Language Model, and Partial LoRA. With this, it processes diverse inputs using a Unified Dynamic Image Partition strategy at a 560 x 560 resolution, with 400 tokens per sub-image. The model also employs a scaled identity strategy for high-resolution images and treats videos as concatenated frames. Further, it offers audio input/output support, using Whisper for transcription and MeloTTS speech synthesis.
Performance-wise, IXC-2.5 exhibits exceptional results across benchmarks. In video understanding, it outperforms most open-source models, and for high-resolution tasks, it competes well with larger models. It exhibits significant enhancements in multi-image multi-turn comprehension, surpassing previous models by 13.8% on the MMDU benchmark. In visual QA tasks, IXC-2.5 matches or even surpasses both open and closed-source models.
In conclusion, IXC-2.5 presents a groundbreaking improvement in LVLMs, displaying exceptional capabilities in handling long-contextual input and output. Despite running on a moderate 7B Large Language Model backend, it delivers competitive performance across different tests. Its architecture and achievements pave the path for future research into more contextual multi-modal environments, thus marking a significant milestone in multimodal AI technology.