Pegasus-1 is a state-of-the-art multimodal Large Language Model (LLM) developed by Twelve Labs and designed to interact with and comprehend video content through natural language. The model is intended to overcome the complexities of video data, including the consideration of multiple modalities in one format and the understanding of the sequence and timeline of visual information. Pegasus-1 is able to handle a variety of video genres and lengths, leading to a more comprehensive understanding of this type of data.
The model consists of three main elements: the Video Encoder Model, the Video-language Alignment Model, and the Large Language Model (or Decoder Model). The architecture of Pegasus-1 can handle extended video lengths and integrates both visual and audio information.
The model’s performance was evaluated using LLM benchmarks such as zero-shot video question answering, video conversation, and video summary. Other models like VideoChat, Video-ChatGPT, Video LLAMA, BT-adapter, LLaMA-VID, and VideoChat2 were utilized as points of comparison. The proprietary Gemini models by Google were also used as reference points.
Pegasus-1’s performance in video conversation benchmark was deemed proficient, with scores of 4.29 in Context and 3.79 in Correctness. The model has shown strengths in Correctness, Detail, Contextual Awareness, Temporal Comprehension, and Consistency, all of which are key characteristics in Video Chats.
The model demonstrated remarkable advancement in zero-shot capabilities, outperforming open-source models and the Gemini series in zero-shot video question answering on ActivityNet-QA and NExT-QA datasets. Comparing to baseline algorithms on the ActivityNet detailed caption dataset, Pegasus-1 outperformed in terms of Correctness of Information, Detailed Orientation, and Contextual Understanding, asserting its superiority in video summarization.
For Temporal Comprehension, the model outperformed open-source benchmarks, especially VideoChat2, using TempCompass as a measure. This evaluation further included tests involving altered video speeds, such as reversing and slowing down, to assess the model’s understanding of different temporal dynamics.
The report on Pegasus-1 provides a detailed analysis into the model’s strengths, limitations, and potential areas of improvement, looking forward to constant enhancement of its features.