Researchers from The University of Sydney have introduced EfficientVMamba, a new model that optimizes efficiency in computer vision tasks. This groundbreaking architecture effectively blends the strengths of Convolutional Neural Networks (CNNs) and Transformer-based models, known for their prowess in local feature extraction and global information processing respectively. The EfficientVMamba approach incorporates an atrous-based selective scanning strategy and the principle of efficient skip sampling. It adeptly captures the essential global and local visual data features without overburdening computational resources.
EfficientVMamba achieves this by combining state space models (SSMs) with conventional convolutional layers. The result is an effective efficient visual state space block that coheres seamlessly with an extra convolution branch. The design also includes a channel attention module that enhances the integration of features. The innovation lies in the dual-pathway design, which tackles the intricate task of global and local feature extraction, improving model performance while dramatically reducing computational complexity.
The effectiveness of EfficientVMamba across various vision tasks, such as image classification, semantic segmentation, and object detection, sets a fresh benchmark for efficiency. Importantly, it outperforms other models while using fewer resources, making it ideal for use in resource-constrained environments.
Data from its application demonstrate the model’s efficiency. Its variant, EfficientVMamba-S, displays a 5.6% accuracy increase on ImageNet compared to VimTi, all at a lower computational cost of 1.3 GFLOPs against VimTi’s 1.5 GFLOPs. In an object detection test on the MSCOCO 2017 dataset, EfficientVMamba-T, with only 13M parameters, achieved a 37.5% average precision, surpassing larger models like ResNet-18, which has 21.3M parameters.
In semantic segmentation tasks, variants EfficientVMamba-T and EfficientVMamba-S achieved mean intersection over union (mIoUs) scores of 38.9% and 41.5% respectively. These results outperform benchmark models such as ResNet-50, again using fewer parameters.
In summary, EfficientVMamba addresses the long-standing dilemma in computer vision of balancing model accuracy with computational efficiency. By introducing an innovative selective scanning method along with efficient skip sampling and dual-pathway feature integration, it has set new standards for high-performance lightweight models. EfficientVMamba’s ability to significantly cut computational load while maintaining or even surpassing the accuracy of more resource-demanding structures suggests a promising direction for future research.