Medical image segmentation is a key component in diagnosis and treatment, with UNet’s symmetrical architecture often used to outline organs and lesions accurately. However, its convolutional nature requires assistance to capture global semantic information, thereby limiting its effectiveness in complex medical tasks. There have been attempts to integrate Transformer architectures to address this, but these are computationally costly and unsuitable for resource-limited healthcare settings.
An alternative method to boost UNet’s global awareness has been the introduction of augmented convolutional layers, self-attention mechanisms, and image pyramids. However, these still fail to effectively model long-range dependencies. Recent studies have suggested enhancing UNet with State Space Models (SSMs), but some such solutions, like U-Mamba, carry overly heavy computational loads and are unsuitable for mobile healthcare settings.
Researchers from various institutions, including Peking University and Beihang University, have now proposed LightM-UNet, a streamlined fusion of UNet and Mamba with a parameter count of just 1M. Instead of relying on convolutional layers or SSMs, LightM-UNet introduces the Residual Vision Mamba Layer (RVM Layer) to extract deep features in a pure Mamba manner, amplifying the model’s ability to model long-range spatial dependencies.
The design of LightM-UNet uses a U-shaped architecture and integrates Mamba, starting with shallow feature extraction, followed by Encoder Blocks, a Bottleneck Block, Decoder Blocks, the RVM Layer, and the Vision State-Space (VSS) Module. This method keeps computational demand low while improving efficacy, marking a pioneering step in integrating Mamba into UNet for optimization.
Performance tests show that LightM-UNet outperforms several other models on different datasets, achieving superior performance with significantly reduced parameters and computational costs. LightM-UNet has shown impressive results, demonstrating improvements over U-Mamba, and consistently outperforming Transformer-based and Mamba-based methods with fewer parameters, representing reductions of up to 99.55%.
In conclusion, the LightM-UNet represents a substantial step towards feasible deployment in resource-constrained healthcare settings. It achieves state-of-the-art 2D and 3D segmentation tasks with only 1M parameters and significantly lower GFLOPS compared to Transformer-based architectures, offering over 99% fewer parameters. The effectiveness of this approach is confirmed in ablation studies, signifying the first utilization of Mamba as a lightweight strategy for UNet.