Skip to content Skip to footer

Harvard and Meta’s AI Research Study Examines the Difficulties and Solutions of Creating Multi-Modal Text-to-Image and Text-to-Video Generative AI Models

We are absolutely thrilled about the incredible advancements in Large Language Models (LLMs)! With LLMs, researchers have developed amazing chatbots like ChatGPT, email assistants, and even coding tools. In fact, ChatGPT has an astounding 100 million active users each week! That is an incredible feat.

But the capabilities of LLMs don’t end there. Text generation is merely the tip of the iceberg. The advent of Text-To-Image (TTI) and Text-To-Video (TTV) models has opened up an entirely new world of possibilities! As such, researchers at Harvard University and Meta have recently undertaken a project to examine the current landscape of TTI/TTV models.

Through their quantitative approach, the researchers created a suite comprising eight representative tasks for text-to-image and video generation. Comparing these tasks with widely utilized language models like LLaMA, they uncovered notable distinctions in system performance limitations. For example, Convolution accounts for up to 44% of execution time in Diffusion-based TTI models, while linear layers consume as much as 49% of execution time in Transformer-based TTI models.

Furthermore, the researchers highlighted the bottleneck related to Temporal Attention, which increases exponentially with increased frames. This underscores the need for future system optimizations to tackle this challenge. To this end, they developed an analytical framework to model the changing memory and FLOP requirements throughout the forward pass of a Diffusion model.

Additionally, they conducted a case study on the Stable Diffusion model to understand the impact of scaling image size. After techniques such as Flash Attention are applied, Convolution has a larger scaling dependence with image size than Attention.

We invite you to join us in exploring these recent innovations in developing multi-modal Text-to-Image and Text-to-Video generative AI models. With the potential of these models, the possibilities are truly endless! Be sure to check out the research paper and join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, LinkedIn Group, and Email Newsletter to stay up-to-date with the latest AI research news, cool AI projects, and more.

Leave a comment

0.0/5