Robustness plays a significant role in implementing deep learning models in real-world use cases. Vision Transformers (ViTs), launched in the 2020s, have proven themselves to be robust and offer high-performance levels in various visual tasks, surpassing traditional Convolutional Neural Networks (CNNs). It’s been recently seen that large kernel convolutions can potentially match or overtake ViTs in terms of performance, reviving the research interest in CNNs. However, the robustness of these large kernel networks needs verification, which is the primary focus of the study conducted by researchers from the Shanghai Jiao Tong University and Meituan.
The researchers undertook a comprehensive evaluation of the robustness of large kernel convolutional networks (convents) in contrast to traditional CNNs and ViTs, across six benchmark datasets. The experiments proved that large kernel convents display outstanding robustness, often outperforming ViTs. The study also conducted nine different experiments, identifying special properties such as occlusion invariance, kernel attention patterns, and frequency characteristics that contribute to this robustness. The results contradict the prevailing belief that self-attention is vital for achieving high robustness, indicating that traditional CNNs can reach similar robustness levels.
Large kernel convolutional networks have been around since the early stages of deep learning models but have been insufficiently explored, sidelined by smaller kernel networks such as VGG-Net and ResNet. Recent models like ConvNeXt and RepLKNet have rekindled interest in large kernels, showing improved performance, particularly in downstream tasks. Using RepLKNet as the primary model, the study evaluated large kernel networks’ robustness across six robustness benchmarks comparing it with models like ResNet-50, BiT, and ViT. Results indicated that RepLKNet excelled in various robustness tests, including those encountering natural adversarial challenges, common corruptions, and domain adaptation.
Large Kernel Convents perform robustly, primarily due to their unique characteristics such as occlusional invariance and kernel attention patterns. These networks handle high occlusion, adversarial attacks, model perturbations, and noise frequency better than conventional models like ResNet and ViT. Despite this empirical analysis, the study recognized the need for more theoretical proofs because of deep learning’s complex nature. The study’s computational constraints limited the ability to perform kernel size ablations on ImageNet-21K, focusing instead on ImageNet-1K. Nevertheless, the study established the significant robustness of large kernel ConvNets across six standard benchmark datasets.
In summary, the research provides critical insights into the remarkable robustness of large kernel ConvNets, the factors contributing to their resilience, and their potential to advance in future research and practical applications.