Recent advancements in neural networks such as Transformers and Convolutional Neural Networks (CNNs) have been instrumental in improving the performance of computer vision in applications like autonomous driving and medical imaging. A major challenge, however, lies in the quadratic complexity of the attention mechanism in transformers, making them inefficient in handling long sequences. This problem is particularly significant in vision tasks where the length of sequences, defined by the number of image patches, can notably strain computational resources and increase processing time.
Various research efforts have been made, including token mixers with linear complexity, such as dynamic convolution, Linformer, Longformer, and Performer. RNN-like models like RWKV and Mamba have been developed to address the need for efficient handling of lengthy sequences. Vision models incorporating Mamba, such as Vision Mamba, VMamba, LocalMamba, and PlainMamba, use structured state space models (SSM) to improve visual recognition tasks and offer a solution to the quadratic complexity challenges posed by traditional attention mechanisms.
Recently, researchers from the National University of Singapore developed MambaOut, an architecture that uses Gated CNN blocks and removes the SSM component of traditional Mamba models. Instead, it focuses on simplifying the design while retaining efficient performance. This new approach aims to evaluate whether the complexities introduced by Mamba are genuinely necessary to achieve premium performance in image classification tasks.
MambaOut employs Gated CNN blocks with token mixing through depthwise convolution, which permits it to have a lower computational complexity than traditional Mamba models. By piling these blocks, MambaOut fabricates a hierarchical model akin to ResNet that handles visual recognition tasks efficiently. [[Evaluator: This section includes a description of the model and its implementation.]]
The performance of the MambaOut has been impressive, surpassing all visual Mamba models in ImageNet image classification with a top-1 accuracy of 84.1%. Nevertheless, it still lags behind some of the top-tier models, such as VMamba and LocalVMamba. This indicates that Mamba is more valuable for tasks with long-sequence features.
In essence, the study has shown that while MambaOut is beneficial for simplifying architecture with tasks like image classification, the Mamba model excels in long-sequence tasks such as object detection and segmentation. This reinforces the potential of Mamba for specific visual tasks and guides future research directions toward advancing vision model optimization. These findings promote further exploration of Mamba’s application in long-sequence visual tasks, offering promising improvements for performance and efficiency of vision models.
For more detailed information, the research paper and its GitHub repository are available for review. Acknowledgements go to the researchers of the project for their instrumental work.