Skip to content Skip to footer
Search
Search
Search

This AI Paper from China Introduces Emu2: A 37 Billion Parameter Multimodal Model Redefining Task Solving and Adaptive Reasoning

We are excited to announce the groundbreaking research from the Beijing Academy of Artificial Intelligence, Tsinghua University, and Peking University introducing Emu2, a 37-billion-parameter model that is rewriting the task-solving and adaptive reasoning rules for multimodal tasks. Any activity that requires comprehension and production in one or more modalities is considered a multimodal task; these activities can be extremely varied and lengthy.

Traditionally, it has been challenging to scale previous multimodal systems because they rely heavily on gathering a large supervised training set and developing task-specific architecture, which must be repeated for every new task. However, present multimodal models have not mastered people’s ability to learn new tasks in context, meaning that they can do so with minimal demonstrations or instructions. Generative pretrained language models have recently shown impressive skills in learning from context.

Emu2 is a 37-billion-parameter model, trained and evaluated on several multimodal tasks. Their findings show that when scaled up, a multimodal generative pretrained model can learn similarly in context and generalize well to new multimodal tasks. The objective of the predict-the-next-multimodal-element (textual tokens or visual embeddings) is the only one used during Emu2’s training. This unified generative pretraining technique trains models by utilizing large-scale multimodal sequences, such as text, image-text pairs, and interleaved image-text video.

The Emu2 model is generative and multimodal; it learns in a multimodal setting to predict the next element. Visual Encoder, Multimodal Modeling, and Visual Decoder are the three main parts of Emu2’s design. To prepare for autoregressive multimodal modeling, the Visual Encoder tokenizes all input images into continuous embeddings, subsequently interleaved with text tokens. The Visual Decoder turns the regressed visual embeddings into a movie or image.

For Emu2’s pretraining, the researchers used several publicly available datasets, such as image-text pairs from LAION-2B and CapsFusion-120M, video-text pairs from WebVid-10M, interleaved image-text data from Multimodal-C4 (MMC4) interleaved videotext data from YT-Storyboard-1B, grounded image-text pairs from GRIT-20M introduced by Kosmos-2 and CapsFusion-grounded-100M curated by CapsFusion120M. Including Pile’s language-only data also helps preserve textual reasoning capacity.

Using conventional multimodal datasets and additional tasks not observed in the training set, the researchers evaluate Emu2’s ability to learn from a few instances or instructions. In particular, two cases are used to assess Emu2: allowing a large number of examples to fit within the model’s context window (a “few-shot setting”) and training the model to obey particular commands (an “instruction tuning”). Across a variety of vision-language tasks, Emu2 shows exceptional few-shot performance on several visual question-answering datasets. An increase in the number of contextual examples leads to improved performance. For real-world tasks like recognition and counting, Emu2 excels at using multimodal reasoning. Despite its troubles at smaller scales or zero shots, Emu2 also learns to follow contextual visual cues.

What’s more, Emu2 is a robust and flexible basic model for many multimodal tasks since it can handle interleaved text-image video at both the input and output sides. It follows task-specific instructions. For instance, following further tweaking using conversational data, Emu2 outperforms earlier models with more complicated designs and obtains state-of-the-art results on visual question-answering tasks. Furthermore, Emu2 can be adjusted to serve as a high-quality, controlled visual generation model. Image, location, and text can all be input as conditions, and they can produce ground-truth images according to the given parameters.

This incredible AI research is redefining the standards for task-solving and adaptive reasoning with a new generative multimodal model that offers impressive skills in learning from context. We can’t wait to see the real-world applications that Emu2 is able to achieve with its incredible potential!

Leave a comment

0.0/5