Open-source large multimodal models (LMMs), such as LLaVA, CogVLM, and DreamLLM, which primarily handle multimodal understanding without generation capabilities, currently face significant limitations. They often lack the native integration required to align visual representations with pre-trained language models, leading to complexity and inefficiency in both training and inference time. Moreover, many are either restricted to generating single-modal results or need separate diffusion models for visual modeling and reproduction.
Addressing these concerns, researchers from the Generative AI Research Lab have developed ANOLE – an open, autoregressive, native LMM adaptable for interleaved image-text generation. This model is built on Meta AI’s Chameleon and uses a data-efficient and parameter-efficient fine-tuning strategy. The aim is to boost Chameleon’s capacity for vision and multimodal generation, without compromising its text generation and comprehension capabilities.
ANOLE employs an early-fusion, token-based autoregressive method to model multimodal sequences, eliminating the need for separate diffusion models by solely depending on transformers. The fine-tuning process is focused on the logits corresponding to image token IDs in the transformer’s output head layer, embodying the notion of “less is more”. ANOLE-7b-v0.1, in particular, was developed using a small amount of image data (5,859 images), and was fine-tuned on less than 40M parameters in around 30 minutes on 8 A100 GPUs.
Despite the limitations of data and parameters, ANOLE has illustrated remarkable image and multimodal generation capabilities, creating high-quality and coherent interleaved image-text sequences. A qualitative analysis revealed that ANOLE can generate diverse and accurate visual productions from textual descriptions, and organically integrate text and images in interleaved sequences.
In conclusion, ANOLE is a game-changing solution that overcomes previous limitations found in open-source LMMs. Not only is it data and parameter-efficient, but it also fosters high-quality multimodal generation capabilities. By building on the framework provided by Chameleon, ANOLE makes advanced multimodal AI technologies more accessible, while encouraging inclusive and collaborative research in this field.