Language models have become an integral part of natural language processing, assisting in tasks like text generation, translation, and sentiment analysis. Their efficiency and accuracy, however, greatly rely on quality training datasets. Creating such datasets can be a complex process, involving the elimination of irrelevant or harmful content, removal of duplicates, and the selection of valuable data sources.
Traditional methods of dataset curation usually depend on heuristic-based filtering, deduplication and sourcing data from large web crawls. These methods, while somewhat successful, often lack standardized benchmarks, creating inconsistency in evaluating language model performance. This inconsistency can make it challenging to determine the best data curation strategies, thereby slowing progress in the field.
To address these gaps, researchers from Apple, the University of Washington and several other institutions have introduced DataComp for Language Models (DCLM). They also recently open-sourced the DCIM models and datasets on the Hugging Face Platform, including DCLM-7B, DCLM-1B, dclm-7b-it, DCLM-7B-8k, dclm-baseline-1.0, and dclm-baseline-1.0-parquet. The DCLM framework provides a standardized approach to dataset curation, allowing for a suite of 53 consistent and comparable experiments.
The DCLM’s structured workflow, ranging from 412M to 7B parameters, empowers researchers to experiment with deduplication, filtering, and data mixing strategies. They can train models using a standardized training recipe and specific hyperparameters, with model performances evaluated on a suite of downstream tasks giving a precise measure of dataset quality.
The implementation of DCLM has notably improved language model training. For example, a baseline dataset created using DCLM allowed for the training of a 7B parameter language model from scratch. This model achieved a substantial improvement over the previous state-of-the-art open-data language model, MAP-Neo.
The scalability of the DCLM framework has been proven by conducting large-scale experiments using DCLM-Pool, a corpus of 240 trillion tokens derived from Common Crawl. The experiments revealed the essential role of model-based filtering in compiling high-quality training sets. The DCLM baseline dataset continually surpassed other open-source datasets like RefinedWeb and RedPajama in various evaluations.
To ascertain the effects of various data curation techniques, the research team compared text extraction methods and found significant improvements in downstream performance. They also assessed several model-based quality filtering strategies, with the fastText OH-2.5 + ELI5 classifier proving to be the most effective.
In summary, the introduction of DCLM allows researchers to conduct controlled experiments and determine the best strategies for improving language models. It provides a standardized approach to dataset curation, setting a new benchmark for dataset quality. It also demonstrates the potential for significant performance improvements with less computational resources. By training models on quality datasets, the effectiveness of language models can be enhanced significantly.