Recent advancements in Large Language Models (LLMs) have seen impressive accomplishments in various tasks, such as question-answering, captioning, and segmentation, thanks to their integration with visual encoders for multimodal tasks. However, these LLMs, despite their prowess, face limitations when dealing with video inputs due to their context length and constraints with GPU memory. Existing models like LLaMA, LLaVA, and BLIP-2 have limitations on the number of tokens they can process per image, limiting their practicality for extended videos like movies or TV shows.
Traditional solutions, like average pooling across the temporal axis used in VideoChatGPT, perform poorly because they lack explicit temporal modeling. Adding a video modeling component in models like Video-LLaMA can capture temporal dynamics and obtain a better video-level representation. But this approach also increases the complexity of the model and is not suitable for real-time video analysis.
Hence, researchers from the University of Maryland, Meta, and Central Florida have proposed a new model called Memory-Augmented Large Multimodal Model (MA-LMM) to efficiently model long-term video inputs. MA-LMM not only reduces the GPU memory usage for extended videos but also addresses the context length issue in LLMs. In contrast to prior approaches, which consumed substantial GPU memory and required large amounts of input text tokens, the memory-augmented model’s online processing approach, sequentially processes the sequences and stores features in a long-term memory bank.
MA-LMM has three key components: visual feature extraction using a visual encoder, long-term temporal modeling using a trainable querying transformer (Q-Former) for aligning visual and text embeddings, and text decoding with a large language model. Frames are processed sequentially with new inputs associated with historical data in a long-term memory bank. The Q-former integrates visual and textual data, and a compression technique minimizes memory bank size without compromising relevant features. Q-Former output is then used for text decoding, which addresses the context length limits and reduces GPU memory requirements during training.
When compared to previous state-of-the-art methods, MA-LMM demonstrates superior performance across various tasks. It outperforms other models in long-term video understanding, video question answering, video captioning, and predicting online actions. With the innovative design of using a long-term memory bank and sequential processing, the model efficiently processes extended video sequences and delivers remarkable results.
In conclusion, the research introduces a long-term memory bank into the existing large multimodal models, i.e., MA-LMM, designed for effective modeling of extended videos. The approach circumvents context length limitations and GPU memory constraints in LLMs by sequentially processing video frames and storing the historical data. The practicability of the long-term memory bank integration into existing models has been demonstrated in experiments and has received superior accolades across various tasks.