Researchers from Shanghai Jiaotong University, Shanghai AI Laboratory, and Nanyang Technological University’s S-Lab have developed an advanced multi-modal large language model (MLLM) called MG-LLaVA. This new model aims to overcome the limitations of current MLLMs when interpreting low-resolution images.
The main challenge with existing MLLMs has been their reliance on low-resolution inputs which compromises their ability to process fine-grained details and recognize smaller objects in complicated images. Several enhancements have been proposed, including training on diverse datasets and using higher resolution images, but these usually require an integration of object-level features and multi-granularity inputs for comprehensive visual understanding.
To tackle this, the team designed MG-LLaVA, an innovative MLLM that significantly improves visual processing by incorporating a multi-granularity vision flow. This involves processing low-resolution, high-resolution, and object-centric features, thereby improving the model’s ability to identify fine details and enhance object recognition.
The architecture of MG-LLaVA is based on two main components: the Multi-Granularity Vision Flow framework and a large language model. In the Vision Flow framework, images are processed at varying resolutions using a CLIP-pretrained Vision Transformer (ViT) for low-resolution features and a CLIP-pretrained ConvNeXt for high-resolution features. A Conv-Gate fusion network helps to align feature channel widths and modulate semantic information efficiently.
Furthermore, MG-LLaVA incorporates object-level features by using a Region of Interest (RoI) alignment to extract in-depth features from identified bounding boxes. These are then combined with other visual tokens, enhancing the model’s ability to capture comprehensive visual details and integrate them with textual embeddings. MG-LLaVA is trained on publicly available multimodal data and is fine-tuned with visual instruction tuning data.
When tested across several benchmarks, including MMBench and SEEDBench, MG-LLaVA excelled, outperforming existing MLLMs of similar parameter sizes. The model dramatically improved perception and visual comprehension, surpassing GPT-4V and GeminiPro-V models. The study also conducted extensive ablation experiments, further cementing the effectiveness of the object-level features and Conv-Gate fusion network.
In essence, MG-LLaVA addresses the current limitations of MLLMs, offering a comprehensive solution that successfully integrates various granularity levels. Its ground-breaking design shows superior performance across many multimodal benchmarks, underscoring its potential to refine the processing of low- and high-resolution images, and object-centric features. The research provides a more advanced and adept tool for processing visual inputs of multiple granularities, including object-level features, original-resolution images, and high-resolution data.