Skip to content Skip to footer

Groundbreaking ‘Vary-Toy’: An Innovative, Compact, Large Vision Language Model for Standard GPUs Uncovered in Chinese AI Paper with Enhanced Vision Vocabulary

Over the last year, large vision language models (LVLMs) have gained significant attention in artificial intelligence research. These models demonstrate remarkable results in various tasks, yet there are still substantial opportunities for improving their visual perception abilities. Progress in this direction faces two primary hurdles: deficiencies in existing vision vocabulary networks and considerable computational costs during optimization.

LVLMs, like CLIP, have shown remarkable results in tasks that intersect Computer Vision (CV) and Natural Language Processing (NLP), such as image captioning, Visual Question Answering (VQA), and meme understanding. However, the model’s capabilities can be limited by its vision vocabulary network’s efficiency in encoding visual signals. To rectify this, a method has been proposed to increase the vision vocabulary for LVLMs by training a new visual vocabulary network using a smaller auto-regressive model and integrating it with the existing vocabulary.

Despite its efficacy, the method has certain limitations, like increased network capacity wastage and high iteration costs. To address these issues, researchers at MEGVII Technology launched Vary-toy, a smaller version of the model that refines the vision vocabulary creation process. Vary-toy uses object detection tasks in the vocabulary network, merging dense textual data and natural object location data, thereby enhancing its universality. After fortifying this vocabulary, it is combined with CLIP and incorporated into a 1.8B language model.

Vary-toy has displayed impressive performance on challenging benchmarks like DocVQA, ChartQA, MMvet, and RefCOCO, and its compact size makes it a practical baseline for researchers with limited resources. The code for Vary-toy is set to be released publicly soon for further exploration and use within the research community.

All credit for this research goes to MEGVII Technology, and further details can be found in their paper. For updates on similar work, follow us on our social media platforms and join our newsletter.

Leave a comment

0.0/5