The field of research that aims to enhance large multimodal models (LMMs) to effectively interpret long video sequences faces challenges stemming from the extensive amount of visual tokens vision encoders generate. These visual tokens pile up, particularly with LLaVA-1.6 model, which generates between 576 and 2880 visual tokens for one image, a number that significantly increases with more frames, thereby creating a bottleneck in the comprehension of long video sequences.
To address this issue, older methods, including visual resamplers used to decrease the number of tokens and heuristic techniques employed for the pruning or merging of visual elements, were used. However, these were not effective in processing an extensive amount of frames. Notably, existing methods such as, visual resampler used both by MPLUG-Owl-video model and MovieChat fail to compress the visual features when dealing with expansive video data.
To overcome these challenges, researchers from Singapore’s NTU, SUTD, and the LMMs-Lab Team proposed a novel solution—Long Context Transfer. This method expands the context length of the language model backbone to allow it to process a significantly larger amount of visual tokens. This solution is unique as it takes advantage of the extended context length of the language model, not requiring any additional video training.
The proposed Long Video Assistant (LongVA) model extends the language model’s context length by training it on longer text data. Corresponding the context-extended language model with visual inputs allows the model to interpret long videos without adding complexity. The unified representation of images and videos provided by the UniRes encoding scheme aids in this process. LongVA, equipped with the ability to treat videos as extended images, can then effectively process long video sequences.
Testing the model on the Video-MME dataset proved LongVA’s capability to handle up to 2000 frames or over 200,000 visual tokens, thereby setting a new standard. The Visual Needle-In-A-Haystack (V-NIAH) benchmark was developed to measure LMMs’ performance in locating and retrieving visual information over long contexts. LongVA outperformed in these evaluations, showing remarkable accuracy in retrieving visual information from up to 3000 frames.
In further experiments, LongVA outperformed other 7B-scale models in processing and understanding long videos. On testing, it demonstrated the effectiveness of the long context transfer by extending the language model’s context length affecting the LMM’s visual processing capabilities.
Detailed experiments were also done utilising Qwen2-7B-Instruct as the backbone language model, and continued pretraining was performed with a context length of 224K over 900 million tokens. This long-context training, which was completed over two days using eight A100 GPUs, demonstrated the feasibility of this method within academic budgets.
In summary, this research addresses the significant problem of processing and understanding long video sequences in large multimodal models. By extending the language model’s context length and aligning it with visual inputs, the proposed LongVA model demonstrated a substantial improvement in processing long videos—setting a new standard in this field. This research underlines the potential of long context transfer to enhance the LMM’s capabilities in long video processing.