Open-source pre-training datasets play a critical role in investigating data engineering and fostering transparent and accessible modeling. Recently, there has been a move from frontier labs towards the creation of large multimodal models (LMMs) requiring sizable datasets composed of both visual and textual data. The rate at which these models advance often exceeds the availability of multimodal training data for unrestricted and open-source models, thus expanding the performance disparity between these two types.
This paper covers related works such as Multimodal Interleaved Data, Large Open-source Pre-training Datasets, and LMMs. First seen in Flamingo and CM3, multimodal interleaved datasets were later made open-source in the form of Multimodal-C4 and OBELICS. More recent efforts like Chameleon and MM1 have supersized OBELICS for training cutting-edge multimodal models. With LMMs, the goal is to pre-train language models using large-scale multimodal interleaved and image-text datasets. This methodology was first introduced by Flamingo before being adopted by open-source models like OpenFlamingo, Idefics, and Emu.
Contributions from researchers in top universities including the University of Washington, Salesforce Research, Stanford University, and the University of California, Berkeley have led to the creation of the Multimodal INTerleaved (MINT-1T) dataset. As the largest and most diverse open-source multimodal interleaved dataset currently, MINT-1T contains one trillion text tokens and three billion images, sourced from multiple platforms such as HTML, PDFs, and ArXiv. Additionally, models trained on MINT-1T displayed a 10x improvement in scale and projected to outpace models trained on top existing open-source datasets like OBELICS.
The MINT-1T dataset surpassed the volume of open-source datasets by incorporating an array of mixed documents such as PDFs and ArXiv papers. Measures applied for text quality control included using Fasttext’s language identification model to exclude non-English documents and eliminating documents with explicit or undesirable content. To better facilitate In-Context Learning, models were made to be prompted with 1 to 15 examples and run a single trial per shot count for each evaluation benchmark.
The researchers concluded the paper by highlighting future plans including training models on larger subsets of MINT-1T and developing multimodal document filtering methods to enhance the quality of data. With its trillion token multimodal interleaved dataset, MINT-1T serves as a crucial component for training LMMs. As this variety offers a broader and more diverse dataset, the research community will be better equipped for conducting open science on multimodal interleaved datasets.