The Transformer architecture has been highly beneficial in natural language processing (NLP) sparking an increased interest in its application within the computer vision (CV) community. Vision Transformers (ViTs), which apply the Transformer’s architecture to vision tasks, have shown great promise across a variety of applications including image classification, object detection, and video recognition. However, ViTs have faced difficulties in real-world situations, particularly in handling variable input resolutions which often leads to significant performance degradation.
To combat this, previous attempts have tried to incorporate multiple-resolution images during training and refine positional encodings. However, improvement is still needed to maintain high performance across various resolution variations and to integrate with common self-supervised frameworks.
Responding to this challenge, a Chinese research team has proposed a groundbreaking solution called Vision Transformer with Any Resolution (ViTAR). This new architecture is designed to process high-resolution images with minimal computation while maintaining strong resolution generalization capabilities. A key component of ViTAR is the Adaptive Token Merger (ATM) module. The ATM module processes tokens after patch embedding and merges them into a fixed grid shape, thereby improving resolution adaptability and reducing computational complexity.
In addition, the team introduced a new concept – Fuzzy Positional Encoding (FPE), which allows generalization to any resolution. FPE brings positional perturbation into the equation, converting precise positional perception into a fuzzy one with random noise. This helps to prevent overfitting and improve adaptability.
The researchers found that ViTAR not only performs efficiently across different input resolutions, but it also outperforms existing ViT-based models. The model’s performance in downstream tasks such as instance segmentation and semantic segmentation further highlights its versatility across a range of visual tasks.
The advancements of ViTAR and its features, including the effective ATM and FPE, signify key developments in the resolution generalization and adaptability of Vision Transformer applications. The research offers significant potential for future growth and development within the Computer Vision community. The original research paper can be accessed for more detailed information.