Google Research recently launched Lumiere, a breakthrough text-to-video diffusion model which produces highly realistic videos from textual or image prompts. In comparison to earlier text-to-video (TTV) models from Pika Labs or Stable Video Diffusion, Lumiere marks a significant progression in the generation of TTV content, particularly with regard to spatial and temporal uniformity.
Lumiere boasts a plethora of video generation capabilities. For instance, it can create a five-second video with 80 frames from a simple textual prompt or morph an image into a video. It also offers stylized generation, whereby it creates videos in the style of a reference image, and video manipulation, where it modifies an existing video to align with a stylistic text prompt. In addition, Lumiere can animate parts of a still image, remove or replace elements in a scene, and even reform incomplete videos.
Existing TTV models were deemed inefficient as they functioned on a system of cascading design wherein a base model built a handful of keyframes and subsequently created content to bridge the gaps. As a result, these videos were often subjected to temporal inconsistency and ‘glitchy motion’. Lumiere, however, employs a Space-Time U-Net(STUNet) architecture that processes all frames simultaneously, thereby providing globally harmonious motion.
Better coherence and realism combined with longer clips make Lumiere preferred over other TTV models, according to a user study conducted by Google Research. However, Lumiere is currently limited as it cannot manage scene transitions or multi-shot video scenes, although advancements on the same are likely underway.
Yet, Google has raised its concerns about the potential misuse of Lumiere to fabricate harmful or counterfeit content, and is pondering over solutions like watermarking videos for better copyright management. Its decision on this matter will determine its prospective market-release and general accessibility.