Large Vision Language Models (LVLMs) have shown excellent performance in tasks that require comprehension of both text and images, with progress in image-text understanding and reasoning becoming particularly noticeable in region-level tasks like Referring Expression Comprehension (REC). Notably, models like Griffon have demonstrated excellent performance in tasks such as object detection, indicating significant advances in LVLM perception.
Despite advancements, LVLMs struggle to outperform specialized models in scenarios of complexity due to the constraints of image resolution. This limitation affects their ability to efficiently refer to objects using both textual and visual cues. This deficiency is especially pronounced in areas such as GUI Agents and counting activities.
To address these limitations, researchers have developed Griffon v2. This unified high-resolution model is designed to provide flexible object referring through textual and visual cues. The model accomplishes this through the use of a downsampling projector. This lightweight solution overcomes the limitations LVLMs face due to their input tokens, thereby significantly improving the model’s multi-modal perception abilities.
One of the key features of Griffon v2 is its capacity to preserve fine features and complete contexts, even for smaller objects missed by lower-resolution models. This ability is bolstered through the integration of a visual tokenizer, which enables Griffon v2 to interact with a variety of inputs, such as coordinates and flexible target images.
Experimental data has shown Griffon v2 to be effective in multiple tasks, including Referring Expression Generation (REG), phrase grounding, and Referring Expression Comprehension (REC). The model also outperforms expert models in object detection and object counting.
The research team highlights two primary contributions of Griffon v2: its high-resolution multimodal perception model improves local understanding by eliminating the need to split images. The visual-language co-referring structure combines visual and language inputs to allow for more adaptive and natural communication between users and the model.
Extensive experiments have confirmed the model’s effectiveness across a range of localization tasks, demonstrating superior performance in comparison with expert models in tasks such as phrase grounding, REG, and REC.
More details about Griffon v2 and the research behind it can be found in the team’s published paper and the Github repository. The team urges interested parties to keep up to date with their research progress by subscribing to their newsletter and following them on various social media platforms.