Recent advancements in generative models for text-to-image (T2I) tasks have yielded impressive results, producing high-resolution and realistic images from text prompts. However, applying these developments to text-to-video (T2V) models is complex due to the introduction of motion. Current T2V models face limitations, especially when it comes to video duration, visual quality, and the generation of realistic motion. These challenges arise primarily from difficulties associated with modeling natural motion, memory and compute requirements, and the need for extensive training data.
Notwithstanding their efficiency at synthesizing high-resolution, photo-realistic images from intricate text prompts, moving these T2I diffusion models to large-scale T2V models poses issues due to the complications of motion. A team of researchers from Google Research, Weizmann Institute, Tel-Aviv University, and Technion have introduced Lumiere, a unique T2V diffusion model that successfully addresses the challenge of synthesizing realistic, varied, and consistent motion.
The Lumiere model, built on a Space-Time U-Net architecture, diverges from existing models, which often generate distant keyframes, then create an illusion of motion with temporal super-resolution. Instead, Lumiere generates the full temporal span of a video in one pass using a pre-trained T2I diffusion model. By doing so, it has efficiently tackled various content creation and video editing tasks, achieving state-of-the-art text-to-video results.
This model’s architecture enables it to process spatial and temporal dimensions effectively, thus producing full-length video clips even at coarse resolutions. It combines temporal blocks with factorized space-time convolutions and attention mechanisms while also using a pre-trained T2I model. This approach ensures smooth transitions between temporal segments and addresses memory constraints.
The Lumiere model surpasses competitors in video synthesis. After undergoing training on a dataset of 30 million 80-frame videos, Lumiere outperformed ImagenVideo, AnimateDiff, and ZeroScope based on both qualitative and quantitative evaluations. It demonstrated superior motion coherence and generated higher-quality, 5-second videos. User studies confirmed Lumiere’s superiority concerning visual quality and consistency with text prompts.
In conclusion, the researchers have introduced Lumiere, an innovative T2V generation framework using a pre-trained T2I diffusion model. They tackled globally coherent motion limitations inherent in existing models through the use of a space-time U-Net architecture. This model has demonstrated versatile applications highlighting the potential for further development and use in numerous tasks, such as image-to-video, video inpainting, and stylized generation.