The power of Large Multimodal Models (LMMs) has shown great potential in furthering artificial general intelligence. These models are enhanced with visual abilities by harnessing vast amounts of vision-language data and aligning vision encoders. Despite this, most open-source LMMs are focused primarily on single-image scenarios, leaving complex multi-image scenarios mostly untouched. This oversight is significant due to the frequent use of multi-image capabilities in many real-world applications, thus highlighting a need for a general LMM framework that can effectively work with multi-image, video, and 3D data.
Researchers from ByteDance, HKUST, CUHK, and NTU have taken the initiative to address this issue. They have proposed LLaVA-NeXT-Interleave, a versatile LMM capable of managing various real-world scenarios including multi-image, multi-frame (video), and multi-view (3D) data. This model stands out for its high performance in these areas while also maintaining strong performance within single-image scenarios. These distinct modes of operation are collectively referred to as M4.
In support of this model, a high-quality training dataset called M4-Instruct has been created, boasting 1177.6 samples. This dataset aides in the enhancement of the M4 capabilities of LMMs by spanning 14 tasks and 41 datasets across the four aforementioned domains. Single model testing of LLaVA-NeXT-Interleave displayed impressive results for a variety of multi-image tasks, outdoing previous state-of-the-art models, while still performing well on single image tasks.
Rigorous testing was carried out on the LLaVA-NeXT-Interleave model using a variety of datasets and criterias. The results showed that the performance of the LLaVA-NeXT-Interleave surpassed earlier open-source models in both in- and out-domain tests. Additionally, the model demonstrated exceptional performance after adding DPO, achieving the highest performances on the Video Detailed Description (VDD) and VideoChatGPT tests.
In terms of 3D evaluation, the model demonstrated a remarkable understanding of the 3D world when utilising multi-view images, scoring high in challenging cases. For single-image tasks, additional data integration has endowed the model with the capability of handling these tasks as well.
Conclusively, the LLaVA-NeXT-Interleave has shown enormous promise as a flexible LMM, capable of managing multiple real-world scenarios such as multi-image, multi-frame, and multi-view. It not only performs well on these complicated tasks, but also sets new standards in the field. The LLaVA-NeXT-Interleave and demonstrated potential indicates a promising direction in the future development of multimodal AI and complex visual understanding tasks.
The research was published by ByteDance, HKUST, CUHK, and NTU and all credit goes to the researchers involved in the project. The detailed paper and GitHub are available for further exploration. You can follow the research team on Twitter and join their Telegram Channel, LinkedIn Group and SubReddit for updates. If you are interested in their work, consider joining their newsletter.