A team of researchers from Peking University and Alibaba Group have introduced FastV, a model designed to mitigate the inefficiencies in computational processing within Large Vision-Language Models (LVLMs). In particular, FastV addresses the bias exhibited by the attention mechanism in LVLMs, which tends to favour textual tokens over visual tokens. Existing models – including LLaVA-1.5 and Video-LLaVA – have shown extensive advancements yet fail to efficiently handle visual tokens, which then hampers LVLMs’ overall performance and computational capability.
LVLMs typically process multimodal inputs by converting images into tokens and combining them with textual tokens in a transformer-based decoder. However, visual tokens often receive lower attention scores than textual tokens, particularly in deep layers of LVLMs. This phenomenon has led to a less than optimal use of visual information and a bottleneck in system efficiency.
The FastV model seeks to remedy these issues by deploying a dynamic pruning method to enhance the computational functionality of LVLMs. This method prunes superfluous visual tokens depending on their attention scores, thereby reducing computational costs without deterring performance quality in various vision-language tasks.
FastV employs a dynamic pruning mechanism for visual tokens in the inference phase of LVLMs, ranking the importance of visual tokens based on attention scores and pruning less relevant tokens beyond a certain layer. Such a strategy significantly lessens the computational load on LVLMs, especially in deeper layers where visual tokens are usually allocated fewer resources. FastV, thereby, achieves a substantial reduction in Floating Point Operations Per Second (FLOPs) while retaining superior performance across various vision-language tasks.
Furthermore, the FastV model provides a flexible solution for customising the trade-off between computational efficiency and performance according to specific user requirements. This versatility makes it an ideal tool for resource-limited environments – targeting image tokens for reduction without compromising the model’s overall functionality.
In conclusion, FastV offers an efficient computational solution for LVLMs, particularly in the handling of visual tokens. It successfully decreases computational costs without undermining output quality across a wide range of vision-language tasks. Consequently, FastV signifies an important step towards enhancing the computational efficiency and practical application of LVLMs, demonstrating a promising solution to the challenges posed by resource constraints in real-world utilisation.