Skip to content Skip to footer

Are We Nearing the Capacity Limit for Big Language Model (BLM) Training Data?

The growth and development of Large Language Models (LLMs) in Artificial Intelligence and Data Science hinge significantly on the volume and accessibility of training data. However, with the constant acceleration of data usage and the requirements of next-generation LLMs, concerns are brewing about the possibility of depleting global textual data reserves necessary for training these models.

Presently, the English part of the FineWeb dataset, which is a part of the Common Crawl web data, boasts 15 trillion tokens. This number could double when high-quality non-English web content is included. Although code repositories such as the globally available Stack v2 dataset contribute a meager 0.78 trillion tokens, the total amount of code worldwide is expected to amount to tens of trillions of tokens.

Academic publications and patents form a unique subset of textual data with an aggregate of about one trillion tokens. Digital book collections, such as those on Google Books and Anna’s Archive, provide over 21 trillion tokens and could ascend to 400 trillion tokens when considering every unique book globally.

However, privacy and ethical restrictions limit LLM access to significant amounts of online user-generated content and private communications. Together, platforms like Weibo, Twitter, and Facebook account for 189 trillion tokens, while emails and stored instant conversations provide around 1,800 trillion tokens. Accessibility to such data yields a useful but often unreachable resource.

As current LLM training datasets near the 15 trillion token threshold, indicating the tally of available high-quality English text, there arise moral and logistical hurdles for future expansion. Utilizing other resources like books, audio transcriptions, and different language data could result in slight increments, potentially enlarging the maximum volume of readable, high-quality data to 60 trillion tokens.

Nevertheless, token counts held in the private databases of Google and Facebook stretch into the quadrillions, not available for ethical business exploration. This limitation necessitates the creation of synthetic data for LLM development. This rising reliance on synthetic data flags a significant shift in AI research, underlining the immediate need for innovative LLM training methods due to ever-increasing data needs and faltering text resources.

Hence, as existing datasets near saturation, the significance of synthetic data grows in surpassing forthcoming restrictions on LLM training data. This change emphasizes the evolution in the field of AI research and prompts a precise shift towards synthetic data synthesis to continue development and uphold ethical adherence.

Leave a comment

0.0/5