Multimodal large language models (MLLMs), which combine text and visual data processing, enhance the ability of artificial intelligence to understand and interact with the world. However, most open-source MLLMs are limited in their ability to process complex visual inputs and support multiple languages which can hinder their practical application.
A research collaboration from several Chinese institutions including Shanghai AI Laboratory, Tsinghua University, and The Chinese University of Hong Kong, has introduced InternVL 1.5. This open-source MLLM is designed to substantially improve the capabilities of interpretable multimodal understanding in open-source systems. It does this through the incorporation of three major advancements aimed at closing the performance gap between open-source and commercial models.
InternVL 1.5 incorporates a robust vision encoder, InternViT-6B, in its first improvement. This encoder has been optimized through continuous learning to enhance its capabilities in understanding visuals.
As its second advancement, the model uses a dynamic high-resolution strategy that can manage images with up to a 4K resolution. This is achieved by dynamically adjusting the image tiles according to the image’s aspect ratio and resolution.
Lastly, the researchers carefully assembled a high-quality bilingual dataset that covers various scenes and document images annotated with question-answer pairs in both Chinese and English. This addition boosts the model’s performance in tasks related to Optical Character Recognition (OCR) and the Chinese language.
InternVL 1.5’s performance has been tested against various benchmarks which show its aptitude in OCR-related datasets and bilingual scene comprehension. For example, it achieved an accuracy of 80.6% in text-based visual question answering, and 90.9% in document-based question answering. This advanced MLLM often surpasses other open-source models and competes fiercely with commercial ones.
In summary, InternVL 1.5 offers vital improvements in processing high-resolution images and supporting multiple languages. Through the implementation of a robust vision encoder, dynamic high-resolution strategy, and comprehensive bilingual dataset, it greatly narrows the performance gap with proprietary models. Its advanced capabilities are displayed through its superior performance in OCR-related tasks and bilingual scene understanding, making it a strong contender among advanced artificial intelligence systems.