Multimodal large language models (MLLMs) are advanced artificial intelligence structures that combine features of language and visual models, increasing their efficiency across a range of tasks. The ability of these models to handle vast different data types marks a significant milestone in AI. However, extensive resource requirements present substantial barriers to their widespread adoption.
Models like MiniGPT-v2 demand considerable computational resources, which are often only available to major corporations with substantial budgets. The high computational overheads associated with these models make their deployment in resource-constrained environments, such as edge computing, a challenge.
A recent study by Tencent, SJTU, BAAI, and ECNU researchers has found that optimizing the efficiency of MLLMs is central to mitigating these challenges. The survey categorizes notable advancements in the field into the following areas: architecture, vision processing, language model efficiency, training techniques, data usage, and practical applications. It clarified that developing lightweight architectures and specialized components designed for efficiency could significantly minimize resource consumption.
The survey disclosed some substantial advancements in the performance of efficient MLLMs. By adopting token compression and lightweight model structures, these models have improved their computational efficiency and broadened their application scope. For instance, models like LLaVA-UHD support processing images with resolutions that are six times larger using only 94% of the computation relative to previous models. Despite these improvements in computational efficiency, performance has not been compromised. Some models like MobileVLM exhibit competitive results in high-resolution image and video understanding tasks.
The key findings from this survey include:
– Intense resource requirements: MLLMs such as MiniGPT-v2 necessitate over 800 GPU hours on NVIDIA A100 GPUs for training, which many smaller organizations cannot afford.
– Optimization Strategies: The research emphasizes creating efficient MLLMs by reducing model size and leveraging pre-trained modality knowledge.
– Vision Token Compression: The use of techniques such as vision token compression can considerably reduce the computational load.
– Training Efficiency: Certain models can be trained in academic settings only within 23 hours using 8 A100 GPUs.
– Performance Gains: Models like LLaVA-UHD demonstrated a substantial improvement in computational efficiency.
– Efficient Architectures: Creating lighter architectures and adopting novel training methods can lead to significant performance improvements.
– Document and Video Understanding: Efficient MLLMs find applications in document understanding and video comprehension tasks.
– Knowledge Distillation and Quantization: Knowledge distillation allows smaller models to learn from larger ones while maintaining accuracy with less memory usage and complexity.
In conclusion, the research on efficient MLLMs is a step towards resolving the critical barriers to their broader use, by suggesting methods to decrease resource consumption, and improve accessibility. The developments highlighted in the survey outline a promising path for future research, suggesting the potential of efficient MLLMs to democratize advanced AI capabilities, and increase their application in practical scenarios.