Diffusion models are widely used in image, video, and audio generation. However, their sampling process is costly in terms of computation, and lacks compared to the efficiency in training. Alternatively, Consistency Models, and their variants Consistency Training and Consistency Distillation, provide quicker sampling but compromise on the quality of images. TRACT is another known method that centers on distillation, breaking down the diffusion trajectory into stages for improving performance. Still, neither Consistency Models nor TRACT manage to reach the same performance levels as regular diffusion models.
These models have been the key focus area of past research works. While Consistency Models operate on different stages, simplifying tasks and enhancing performance, TRACT is more concerned with distillation. It continuously reduces the stages to one or even two for sampling. DDIM, another technique, demonstrates that deterministic samplers are superior to stochastic ones when sampling steps are limited. Other methods include second-order Heun samplers, various SDE integrators, specialized architectures, and Progressive Distillation to minimize model evaluations and the steps in sampling.
Researchers from Google Deepmind have proposed a machine learning method, the Multistep Consistency Models, merging the principles of Consistency Models and TRACT. They design it to close the performance gap seen between standard diffusion models and their low-step counterparts. The approach allows for more flexibility in function evaluations, permitting 4, 8, or 16 stages instead of sticking to a single-step rule. This model divides the diffusion process into equal segments, thereby simplifying modeling and enhancing performance with fewer steps involved. Hypothetically, it should result in faster convergence through fine-tuning, and provides a balance between the quality of samples and the time taken as the number of steps increase.
Test results indicate that these MultiStep Consistency Models surpass Progressive Distillation in achieving top-notch FID scores on ImageNet64 across various step counts. On ImageNet128 also, the models outperform Progressive Distillation. Even in text-to-image tasks, the minor differences in sample details between MultiStep Consistency Models and standard diffusion models highlight their effectiveness in improving the quality and efficiency of samples compared to currently available techniques.
In sum, Google’s researchers introduce the multistep consistency models, an innovative machine learning method that combines aspects of consistency models and TRACT. Their goal is to bridge the performance gap between standard diffusion and few-step sampling. It offers a direct trade-off between the quality of samples and speed, reaching an unparalleled performance in only eight steps. This integration significantly elevates efficiency and quality in generative modeling tasks, a major milestone in machine learning research.