Skip to content Skip to footer
Search
Search
Search

Alibaba Researchers Propose I2VGen-xl: A Cascaded Video Synthesis AI Model which is Capable of Generating High-Quality Videos from a Single Static Image

Be excited, for researchers from Alibaba, Zhejiang University, and Huazhong University of Science and Technology have come together to introduce a revolutionary video synthesis model, I2VGen-XL! This model is more than capable of addressing key challenges in both semantic accuracy, clarity, and spatio-temporal continuity. Video generation is typically hindered by the lack of well-aligned text-video data and the complex structure of videos. However, with the two-stage cascaded approach I2VGen-XL presents, these obstacles are no longer a problem.

In the base stage, two hierarchical encoders are used to ensure coherent semantics and preserve content. A fixed CLIP encoder extracts high-level semantics, while a learnable content encoder captures low-level details. These features then come together in a video diffusion model to generate videos with semantic accuracy at a lower resolution.

The refinement stage is then used to further enhance video details and resolution to 1280×720. This is done by incorporating additional brief text guidance and utilizing a distinct video diffusion model and a simple text input. To enrich the diversity and robustness of I2VGen-XL, the researchers collect a vast dataset comprising around 35 million single-shot text-video pairs and 6 billion text-image pairs, covering a wide range of daily life categories.

Through extensive experiments, the researchers compare I2VGen-XL with existing top methods, demonstrating its effectiveness in enhancing semantic accuracy, continuity of details, and clarity in generated videos. Latent Diffusion Models (LDM) are used to learn a diffusion process to generate target probability distributions. I2VGen-XL adopts a 3D UNet architecture for LDM, referred to as VLDM, to achieve effective and efficient video synthesis. The refinement stage is then analyzed in the frequency domain, highlighting its effectiveness in preserving low-frequency data and improving the continuity of high-definition videos.

What’s more, I2VGen-XL showcases richer and more diverse motions in comparison to top methods like Gen-2 and Pika. It also demonstrates its generalization ability through qualitative analyses on a diverse range of images. In conclusion, I2VGen-XL is a tremendous breakthrough in video synthesis, successfully addressing key challenges in semantic accuracy and spatio-temporal continuity. The cascaded approach, coupled with extensive data collection and utilization of Latent Diffusion Models, positions I2VGen-XL as a promising model for high-quality video generation from static images.

So, what are you waiting for? Join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. And don’t miss out on our newsletter, if you like our work!

Leave a comment

0.0/5