In recent years, diffusion models have emerged as powerful assets in various fields including image and 3D object creation. Renowned for their proficiency in managing denoising assignments, these models can effectively transform random noise into the targeted data distribution. But their deployment triggers high computational costs, mainly because these deep networks are dense, which means all parameters are deployed for each example.
The concept of Conditional Computation has been devised to offset these challenges. Designed to enhance model capacity while maintaining steady training and inference costs, this method uses only a sectional cluster of parameters for each example. Mixture of Experts (MoEs) framework, successful in multiple fields, blends the outputs of numerous sub-models or ‘experts’ through an input-dependent router.
The DiT-MoE model, proposed by researchers from Kunlun Inc. in Beijing, is a recent iteration of the DiT structure used for creating images. DiT-MoE modifies some dense feedforward layers in DiT, replacing them with sparse MoE layers. This involves directing each image token to a distinct set of experts, which operate as MLP layers.
This system features two elements: the sharing of some experts to capture common knowledge, and balancing expert loss to curb redundancy in varying routed experts. The paper details how these adaptations make it feasible to train an efficient MoE diffusion model, while revealing intriguing patterns in expert routing. Use of the AdamW optimizer across all datasets, with a consistent learning rate and an exponential moving average (EMA), contributes to the success of the DiT-MoE model.
Tests of DiT-MoE on the ImageNet database using an Nvidia A100 GPU yielded promising results. Functioning with only 1.5 billion parameters, the model surpassed all previous models in performance. Achieving an FID score of 1.72, the DiT-MoE model outperformed all including Transformer-based rivals.
In conclusion, the research presents the DiT-MoE model as a significant advancement in image generation. By incorporating sparse MoE layers, the DiT-MoE model employs conditional computation to optimally train large diffusion transformer models for efficient inference and improved image generation. As a pioneer in the exploration of sizable conditional computations for diffusion models, the study paves the way for future endeavours that can expedite complex expert architectures while enhancing knowledge distillation.