Large language models like GPT-4, while powerful, often struggle with basic visual perception tasks such as counting objects in an image. This can be due to the way these models process high-resolution images. Current AI systems can mainly perceive images at a fixed low resolution, leading to distortion, blurriness, and loss of detail when the images are resized or cropped.
A team of researchers from Tsinghua University, the National University of Singapore, and the University of Chinese Academy of Sciences sought to address this issue. They developed a new method to build encoder-decoder models capable of handling high-resolution images without losing their fine-grained visual information. The method is called LLaVA-UHD.
At the heart of LLaVA-UHD is a system that divides large images into small, variable-sized slices that are closely aligned with the original training data. Each slice is resized to fit the encoder while maintaining its native aspect ratio. A shared ‘compression layer’ then reduces the visual tokens for each slice to lessen the computational load on the language model. To provide spatial context to the parsed slices, LLaVA-UHD employs a simple positional encoding scheme. The method can parse high-resolution images up to 672×1088 pixels using only 94% of the computation required for low-resolution images with older models.
The developers tested LLaVA-UHD in a series of complex multimodal benchmarks involving visual question answering and optical character recognition among others. In all tests, LLaVA-UHD outperformed standard models and other specialized high-res systems while consuming less computing power during training. On its run on the TextVQA benchmark which tests OCR capabilities, LLaVA-UHD managed a 6.4 point accuracy improvement over the previously best recorded result.
The key to these performance improvements is linked to LLaVA-UHD’s ability to preserve fine visual details in high-resolution images. In contrast, models that work with low-resolution and blurred inputs can only make best guesses. LLaVA-UHD provides a more complete picture.
Despite these improvements, the work is not over. The researchers have their sights set on even higher resolutions and more complex tasks like object detection. However, LLaVA-UHD represents a critical step towards enabling AI models to perceive images with the same vivid detail as humans.
The researchers have made their project’s paper and Github available for anyone interested to check out. They also encourage interested parties to join their Telegram Channel, Discord Channel, and LinkedIn Group and to follow them on Twitter. Those appreciative of their work are invited to subscribe to their newsletter and join their SubReddit community.