Researchers from Fudan University and Microsoft have developed a novel architecture for language and vision models (LMMs), called “DeepStack.” The DeepStack model takes a different approach to processing visual data, thereby improving overall computational efficiency and performance.
Traditional LMMs typically integrate visual and textual data by converting images into visual tokens, which are then processed as linear sequences by models like BERT, T5 or GPT – recent advancements in language and vision models that have reshaped natural language processing (NLP). While this method has proven effective for multimodal understanding, it requires significant computational resources, particularly when dealing with high-resolution imagery or video.
DeepStack was developed to overcome this challenge. Instead of processing visual tokens as a long sequence in the first layer of the language model, DeepStack distributes the tokens across multiple layers. This bottom-to-top approach not only enhances the model’s ability to handle complex visual inputs but importantly, it does not increase computational costs.
Importantly, DeepStack went through rigorous testing using the LLaVA-1.5 and LLaVA-Next models and demonstrated noteworthy performance improvements across multiple benchmarks. The architecture showed particular strength in handling high-resolution tasks and could manage more tokens efficiently compared with traditional methods.
Moreover, DeepStack leverages a dual-stream approach to achieve this. It splits image processing into a global view stream for high-level information and a high-resolution stream for detailed image features. High-resolution tokens are upsampled and dilated before being fed into different layers of the language model. This method allows complex visual inputs to be more effectively managed, unlike traditional approaches where visual tokens are concatenated, DeepStack integrates them across layers.
The success of DeepStack was evident through various experiments which showed the model’s effectiveness in boosting multimodal language models’ performance by integrating high-resolution visual tokens. DeepStack consistently displayed superior performance over benchmark models like LLaVA on VQA and multimodal metrics. Its capacity to handle complex visual information highlights its potential to significantly enhance model efficiency and performance without adding additional computational costs.
In summary, the DeepStack model presents a novel way of enhancing LMMs that reduce computational and memory demands while significantly improving performance on high-resolution tasks. By distributing visual tokens across different layers of the transformer, DeepStack allows more effective interactions between these tokens, leading to substantial performance gains. Furthermore, the technique is particularly useful in tasks requiring detailed visual comprehension, marking a significant step towards more efficient and powerful multimodal models.