Multimodal large language models (MLLMs), which combine text and visual data processing, enhance the ability of artificial intelligence to understand and interact with the world. However, most open-source MLLMs are limited in their ability to process complex visual inputs and support multiple languages which can hinder their practical application.
A research collaboration from several Chinese institutions…
