Artificial general intelligence has advanced significantly, thanks in part to the capabilities of Large Language Models (LLMs) such as GPT, PaLM, and LLaMA. These models have shown impressive knowledge and generation of natural language, highlighting the direction of future AI. However, while LLMs excel at text processing, video processing with complex temporal information remains a challenge.
Existing methods to facilitate video understanding in LLMs have significant limitations. Some methods involve the average pooling of video frames, which isn’t effective in capturing dynamic temporal sequences. Other methods use additional structures for temporal sampling and modelling, requiring significant computational resources and multi-stage pretraining.
To address this issue, researchers from Peking University and Tencent have proposed a novel approach known as ST-LLM. The fundamental concept is to use the robust sequence modelling capabilities of LLMs to process the raw spatial-temporal video tokens directly.
ST-LLM involves feeding all video frames into the LLM, enabling effective spatial-temporal sequence modelling. To handle the potential issue of increased context length in long videos, the team introduced a dynamic video token masking strategy and masked video modelling during training. This doesn’t only reduce sequence length but also increases the model’s robustness, allowing it to handle varying video lengths during inference.
For exceptionally long videos, ST-LLM uses a global-local input mechanism. This combines the average pooling of many frames or a global representation, with a smaller subset of frames or local representation. This asymmetric design enables the processing of many video frames while preserving the video tokens’ modelling within the LLM.
Experiments on video benchmarks such as MVBench, VideoChatGPT-Bench, and zero-shot video QA have showcased the effectiveness of ST-LLM. The model showed a superior temporal understanding relative to other video LLMs and demonstrated a remarkable ability to capture complex movement and scene transitions. Notably, ST-LLM scored highly in metrics related to time-sensitive motion.
Although ST-LLM may have challenges with detailed tasks like pose estimation, it’s a significant advantage that it can utilise LLM’s sequence modelling capabilities without adding extra modules or pricey pre-training. The research team has successfully utilised LLMs for video understanding, opening new possibilities in this field.
Find out more with the research paper and GitHub details on the project, along with relevant social channels mentioned above.