Presently, multi-modal language models (LMs) face challenges in executing sophisticated visual reasoning tasks. Such tasks require a mix of deep object motion and interaction analysis, and higher-order causal and compositional spatiotemporal reasoning. The capabilities of these models need further examination, especially when it comes to tasks requiring detailed attention to refined details while also applying an advanced rationale.
Research in the field of multi-modal LMs involves models like Pix2seq, ViperGPT, VisProg, Chameleon, PaLM-E, LLaMA-Adapter, FROMAGe, InstructBLIP, Qwen-VL, and Kosmos-2 for image-based tasks and Video-ChatGPT, VideoChat, Valley, and Flamingo for video-based tasks. Spatiotemporal video grounding, which employs linguistic cues for object localization in media, is a new area of focus. Attention-based models are fundamental to this research, and utilize strategies such as multi-hop feature modulation and cascaded networks for improved visual reasoning.
In an attempt to enhance low-level visual skills, Qualcomm AI Research has developed a multi-modal LM, trained wholly on tasks like object detection and tracking. This model makes use of a two-stream video encoder with spatiotemporal attention for static and motion indications, applying a “Look, Remember, Reason” approach.
The research relies on ACRE, CATER, and STAR datasets and introduces surrogate tasks like object recognition and re-identification, and state identification of the blicket machine during training. Despite using fewer parameters, the model is trained effectively using the OPT-125M and OPT-1.3B architectures until convergence, with the help of the AdamW optimizer.
The multi-modal LM, also referred to as the LRR framework, performed impressively in the STAR challenge as of January 2024. Experimentation on various datasets such as ACRE, CATER, and Something-Else demonstrated the model’s effectiveness and its ability to adapt in processing low-level visual cues. The LRR’s superior performance, surpassing task-specific methods, reaffirms its potential to boost video reasoning skills.
In summary, the LRR model follows a three-step process: “Look, Remember, Reason”. It extracts visual information via low-level graphic skills and amalgamates these to generate a final response. The model can effectively mine static and motion-based information from videos through a two-stream video encoder with spatiotemporal attention. Future developments may involve integrating datasets like ACRE, potentially enhancing the framework’s performance and applicability further. The credit for this research goes to the Qualcomm AI Research team.