Zyphra, a company specialized in data science, recently unveiled Zyda, a major 1.3 trillion-token open dataset for language modeling. The company claims that Zyda is set to revolutionize the norms of language model training and research by offering an unrivaled blend of size, quality, and accessibility.
Zyda is a combination of many superior open datasets that have gone through stringent filtering and deduplication to ensure the highest standards of data quality. The dataset has been expressly designed to facilitate large-scale language modeling experiments and training, a feat previously unattainable with open datasets. Zyda has shown consistently enhanced performance compared to existing datasets such as Dolma, Fineweb, Pile, RefinedWeb, and SlimPajama, making it an essential resource for researchers and developers in the field of language modeling.
One of the most standout features of Zyda is its unmatched token count, incorporating 1.3 trillion rigorously filtered and deduplicated tokens gathered from top-quality datasets. This enormity ensures that models trained on Zyda can achieve remarkable accuracy and robustness.
Zyda’s noteworthy performance outshines all major open language modeling datasets in comparative evaluations, including individual subsets of these datasets. This validates the effectiveness of Zyda’s holistic approach to data aggregation and processing. Another crucial feature is its implementation of cross-dataset deduplication that guarantees the elimination of duplicates within and between individual datasets, maintaining the data’s integrity and uniqueness.
Zyda is released under an open and permissive license, making it freely available to the community, in line with Zyphra’s commitment to promoting open research and collaboration in Natural Language Processing (NLP).
The creation of Zyda involved merging seven renowned open language modeling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arXiv. All data sets underwent a uniform post-processing pipeline, enhancing overall quality and coherence.
The crafting process entailed comprehensive syntactic filtering to get rid of low-quality documents, followed by robust deduplication. Cross-deduplication was particularly significant, considering the significant overlaps in many datasets from common data sources like Common Crawl. This exhaustive cleaning process whittled down the initial 2 trillion tokens to a more refined and manageable 1.3 trillion.
Zyda’s effectiveness is confirmed in the performance of Zamba, a language model trained on Zyda. Zamba displays significant strength on a per-token basis compared to models trained on rival datasets, underlining Zyda’s superior quality and potential to advance language modeling.
In summary, the introduction of Zyda highlights a significant advancement in language modeling, setting a new benchmark for what is achievable with open datasets. Zyphra continues to lead in the field, shaping the future generation of NLP research and applications by providing a large-scale, high-quality, open dataset.