Deep learning models such as Convolutional Neural Networks (CNNs) and Vision Transformers have seen vast success in visual tasks like image classification, object detection, and semantic segmentation. However, their ability to accommodate different data changes, particularly in security-critical applications, is a significant concern. Many studies have assessed the robustness of CNNs and Transformers against common corruptions, domain shifts, information drops, and adversarial attacks, finding that a model’s design influences its capacity to tackle these challenges and that robustness varies across different architectures. Transformers have the added difficulty of their quadratic computational scaling with input size, increasing costs for complex tasks.
Researchers have therefore turned their focus towards the Robustness of Deep Learning Models (RDLM) and State Space Models (SSMs). RDLM studies how effectively a conventionally trained model maintains high performance when encountering natural and adversarial changes in data distribution. Deep learning models often deal with data corruption (noise, blur, compression artifacts, intentional disruptions) which can significantly damage their performance. The evaluation of this performance under tough conditions is therefore crucial to ensure model reliability and robustness. Alternatively, SSMs offer a promising approach for modelling sequential data in deep learning, transforming a one-dimensional sequence with an implicit latent state.
A group of researchers from MBZUAI UAE, Linkoping University, and ANU Australia have undertaken a comprehensive analysis of the performance of Vision State-Space Models (VSSMs), Vision Transformers, and CNNs. These evaluations are divided into three parts, each focused on a critical area of model robustness: occlusions and information loss; common corruptions and adversarial attacks.
The results of these evaluations showed that ConvNext and VSSM models manage sequential information loss along the scanning direction more effectively than ViT and Swin models. In cases of patch drops, VSSMs demonstrated superior robustness, whilst Swin models performed better under extreme information loss. For global corruption, VSSM models experienced the smallest average performance drop compared to Swin and ConvNext models. For fine-grained corruptions, VSSM models surpassed all transformer-based variants. Finally, smaller VSSM models demonstrated robustness against white-box adversarial attacks, maintaining over 90% robustness for strong low-frequency perturbations. However, their performance declined rapidly with high-frequency attacks.
In conclusion, the researchers exhaustively assessed the robustness of Vision State-Space Models under various natural and adversarial disturbances, exposing their strengths and weaknesses compared to transformers and CNNs. From this research, they have ascertained the capabilities and limitations of VSSMs in handling occlusions, common corruptions, and adversarial attacks and their adaptability to changes in object-background composition in complex visual scenes. This study will therefore be enormously beneficial in directing future research to boost the reliability and effectiveness of visual perception systems in real-world situations.