Large Language Models (LLMs) have become critical tools for Natural Language Processing (NLP) tasks, including question-answering, text summarization, and few-shot learning. Despite their prevalence, the development process of the more potent models, particularly their pretraining data composition, often remains undisclosed. This tendency towards opacity complicates our understanding of how the pretraining corpus influences a model’s abilities and shortcomings, hinders scientific progress, and affects the end-users of these models.
In a bid to promote transparency in language model pretraining, a recent study has introduced Dolma, an enormous English corpus comprising three trillion tokens. Dolma is compiled from a variety of sources, including encyclopedias, scientific publications, code repositories, public-domain literature, and web content. The researchers have also made their data curation toolkit available for public use to encourage additional experimentation and replication of their results.
The primary objective of the study is to make language model research and development more accessible. The authors emphasize the need for data transparency and openness, stating that it enables application developers and users to make more informed decisions and improves task performance. They argue that studying how data composition influences model behavior necessitates open pretraining data. This will allow the modeling community to inspect and refine cutting-edge data curation techniques and address major concerns such as training data attribution, adversarial attacks, deduplication, memorization, and benchmark contamination.
Furthermore, the availability of diverse, large-scale pretraining data is crucial for the successful creation of open language models. These datasets enable newer models to attribute their generated output to pretraining data.
The research paper documents Dolma in detail, including descriptions of its content, assembly procedures, and architectural principles. It features analyses and experimental results obtained from training language models at various intermediate levels of Dolma. These findings offer insights into effective data curation techniques, such as the impact of content/quality filters, deduplication methods, and the benefits of using a multi-source blend in training data.
Using Dolma, the team trained a cutting-edge open language model and framework, OLMo. OLMo aims to push forward the field of language modeling by showcasing the potential of the Dolma corpus. The researchers’ main contributions include launching the Dolma Corpus, consisting of three trillion tokens from seven unique sources, and the Open Sourcing Dolma Toolkit, a high-performing, user-friendly tool for curating large datasets for language model pretraining. The toolkit aids practitioners in constructing their data curation pipelines and reproducing the curation process.
The complete research is available in the team’s paper and on Github. For more updates on their work, you can follow them on Twitter and Google News, and join their Machine Learning SubReddit, Facebook Community, Discord Channel, LinkedIn Group, and Telegram Channel. They also produce a newsletter providing additional content.