Skip to content Skip to footer

Common Corpus: A Vast Open-Source Database for Training LLMs

The debate over the necessity of copyrighted materials to train top Artificial Intelligence (AI) models continues to be a hot topic within the AI industry. This discussion was fueled further when OpenAI proclaimed to the UK Parliament in 2023 that it’s ‘impossible’ to train these models without using copyrighted content, resulting in legal disputes and ethical dilemmas. However, recent developments have contested this belief, revealing that large language models (LLMs) can be trained devoid of such controversial use of copyrighted materials.

An international collaboration called the Common Corpus initiative has proven to be the largest public domain dataset designed for training LLMs. Led by Pleias and supported by researchers specialising in LLM pretraining, AI ethics and cultural heritage, the project is paving a new path for AI practices. This diverse, multilingual dataset has shown that LLMs can be trained without worries over copyright infringement, marking a notable shift in the AI industry.

Simultaneously, Fairly Trained, a principal non-profit in the AI sector, has made substantial strides towards equitable AI practices by awarding its inaugural certification for an LLM devoid of copyright infringements. The KL3M model by Chicago-based legal tech consultancy startup 273 Ventures is seen as a symbol of hope for equitable AI.

Meanwhile, the Kelvin Legal DataPack, a rigorously curated training dataset by Fairly Trained, makes a strong case for the power of data selection. Although relatively smaller than some internet-scraped datasets, it outperforms many. It contains billions of tokens from legal documents adhering to copyright laws, underlining the potential of well-curated datasets to vastly enhance AI models.

The Common Corpus project signals a revolution in the AI field by developing a training resource equivalent to the data used for OpenAI’s GPT-3 model. This resource is now accessible via the open-source AI platform, Hugging Face. A broader scope of AI certification is envisioned, as demonstrated by Fairly Trained’s recent certifications granted to non-LLMs entities such as the Spanish voice-modulation startup VoiceMod and the AI heavy-metal band Frostbite Orckings.

However, it’s critical to acknowledge the limitations of datasets like the Kelvin Legal DataPack. Much of the public domain data, particularly in the US, is outdated due to extended copyright protection laws, often pushing beyond 70 years since the author’s death. Thus, such collections may not be apt for training AI models on current affairs.

Overall, the Common Corpus initiative and Fairly Trained’s certification of KL3M exemplify the changing paradigms in AI training. Although challenges remain, these efforts, particularly in developing copyright-compliant training datasets, hint at progressive pathways. The focus on ethical considerations and respect for copyright laws may herald a more balanced and fair future for AI practices.

Leave a comment

0.0/5