MINT-1T: A Free-to-use Trillion Token Multimodal Interweaved Collection and a Crucial Element for Educating Extensive Multimodal Models LMMs

Open-source pre-training datasets play a critical role in investigating data engineering and fostering transparent and accessible modeling. Recently, there has been a move from frontier labs towards the creation of large multimodal models (LMMs) requiring sizable datasets composed of both visual and textual data. The rate at which these models advance often exceeds the availability of multimodal training data for unrestricted and open-source models, thus expanding the performance disparity between these two types.

This paper covers related works such as Multimodal Interleaved Data, Large Open-source Pre-training Datasets, and LMMs. First seen in Flamingo and CM3, multimodal interleaved datasets were later made open-source in the form of Multimodal-C4 and OBELICS. More recent efforts like Chameleon and MM1 have supersized OBELICS for training cutting-edge multimodal models. With LMMs, the goal is to pre-train language models using large-scale multimodal interleaved and image-text datasets. This methodology was first introduced by Flamingo before being adopted by open-source models like OpenFlamingo, Idefics, and Emu.

Contributions from researchers in top universities including the University of Washington, Salesforce Research, Stanford University, and the University of California, Berkeley have led to the creation of the Multimodal INTerleaved (MINT-1T) dataset. As the largest and most diverse open-source multimodal interleaved dataset currently, MINT-1T contains one trillion text tokens and three billion images, sourced from multiple platforms such as HTML, PDFs, and ArXiv. Additionally, models trained on MINT-1T displayed a 10x improvement in scale and projected to outpace models trained on top existing open-source datasets like OBELICS.

The MINT-1T dataset surpassed the volume of open-source datasets by incorporating an array of mixed documents such as PDFs and ArXiv papers. Measures applied for text quality control included using Fasttext’s language identification model to exclude non-English documents and eliminating documents with explicit or undesirable content. To better facilitate In-Context Learning, models were made to be prompted with 1 to 15 examples and run a single trial per shot count for each evaluation benchmark.

The researchers concluded the paper by highlighting future plans including training models on larger subsets of MINT-1T and developing multimodal document filtering methods to enhance the quality of data. With its trillion token multimodal interleaved dataset, MINT-1T serves as a crucial component for training LMMs. As this variety offers a broader and more diverse dataset, the research community will be better equipped for conducting open science on multimodal interleaved datasets.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

MINT-1T: A Free-to-use Trillion Token Multimodal Interweaved Collection and a Crucial Element for Educating Extensive Multimodal Models LMMs

Leave a comment Cancel reply

You May Also Like

Insights into a Unified E-Commerce System: From GS1 Global Forum to Innovative SEO Strategies

Introducing SWE-Agent: An open-source program designed for software engineering that can rectify glitches and problems in GitHub Repositories.

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

MINT-1T: A Free-to-use Trillion Token Multimodal Interweaved Collection and a Crucial Element for Educating Extensive Multimodal Models LMMs

Leave a comment Cancel reply

You May Also Like

Insights into a Unified E-Commerce System: From GS1 Global Forum to Innovative SEO Strategies

Introducing SWE-Agent: An open-source program designed for software engineering that can rectify glitches and problems in GitHub Repositories.

+60 12-462 2768

All
Categories

All
Categories

All
Categories