Meta AI presents Chameleon: A novel range of preliminary fusion token-based foundational models that establish a fresh benchmark for multimodal machine learning.

Recent multimodal foundation models are often limited in their ability to fuse various modalities, as they typically utilize distinct encoders or decoders for each modality. This structure limits their capability to effectively integrate varied content types and create multimodal documents with interwoven sequences of images and text.

Meta researchers, in response to this limitation, have introduced Chameleon, a unique mixed-modal foundation model. Chameleon outstrips its predecessors by efficiently generating and reasoning with interleaved sequences of images and text, allowing the creation of comprehensive multimodal documents. Unlike conventional models, Chameleon utilizes a unified architecture that tokenizes images similar to text, establishing an equal footing for both modalities. This technique, categorized as early fusion, facilitates cohesive reasoning across modalities but also carries challenges related to optimization.

To conquer these challenges, the research team incorporated architectural improvements and devised new training methods, including adaptations of the transformer architecture. The team also developed an innovative image tokenizer capable of encoding 512 × 512 images into 1024 tokens via an 8192-codebook. However, the tokenizer might have some limits when it comes to text-heavy image reconstruction.

During training, Chameleon tackled stability issues using dropout, QK-Norm, and z-loss regularization, which ultimately allowed effective training on Meta’s RSC. The researchers also simplified mixed-modal generation processing at the inference stage by using PyTorch and xformers. They additionally leveraged token masking for conditional logic.

The fine-tuning phase involved curating high-quality images using an aesthetic classifier and testing the model on diverse datasets for Text, Code, Visual Chat, and Safety. The Fine-Tuning process employed data balancing across modalities, a cosine learning rate schedule, and a weight decay of 0.1. The researchers compiled instances of prompts paired with corresponding responses, exclusively optimizing based on the responses.

When assessing its text-only abilities, Chameleon demonstrated competitive performance vis-à-vis state-of-the-art models across various tasks like commonsense reasoning and math. It outperformed LLaMa-2 on numerous tasks owing to improved pre-training and incorporating code data. Chameleon also excelled in image-to-text tasks, particularly image captioning, outshining larger models such as Flamingo-80B and IDEFICS-80B. In Visual Question Answering (VQA), Chameleon’s performance approached that of top models, though Llava-1.5 slightly surpassed it on VQA-v2.

While concluding, the study recaps Chameleon’s superiority in vision-language tasks due to its integration of image and text tokens and its early-fusion approach, which helped address scalability obstacles. The model’s agility and strength across different tasks, requiring fewer training examples and smaller model sizes, are notable. It signifies a significant advancement in multimodal interaction with a strong performance on mixed-modal open-ended QA benchmarks, surpassing late-fusion models like Flamingo and IDEFICS.

The research authors deserve full credit for this pioneering project. For those interested in learning more or monitoring future developments, they can follow the team on Twitter, join their Telegram or Discord channels, and subscribe to their newsletter.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Meta AI presents Chameleon: A novel range of preliminary fusion token-based foundational models that establish a fresh benchmark for multimodal machine learning.

Leave a comment Cancel reply

You May Also Like

Artificial Intelligence encounters environment issues: MIT’s 2023 Climate and Energy Hackathon

Google Expresses Regret for Supposed ‘Progressive’ AI-Created Pictures

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Meta AI presents Chameleon: A novel range of preliminary fusion token-based foundational models that establish a fresh benchmark for multimodal machine learning.

Leave a comment Cancel reply

You May Also Like

Artificial Intelligence encounters environment issues: MIT’s 2023 Climate and Energy Hackathon

Google Expresses Regret for Supposed ‘Progressive’ AI-Created Pictures

+60 12-462 2768

All
Categories

All
Categories

All
Categories