Large language models (LLMs) have gained significant traction in the AI community for their ability to process vast amounts of online data. However, questions arise when these models use so much data that they exhaust available information sources. This situation leads to the phenomenon known as “copyright laundering” or “data laundering,” a questionable practice where AI systems misappropriate data from various sources, obfuscate its origin, and then offer it for commercial use.
A recent case that sheds light on this shady operation revolves around HuggingFace and its product, Cosmopedia. Generative AI models like Cosmopedia require large datasets for training. Cosmopedia boasts gigantic databases, offering 30 million text files and 25 billion “tokens” to train AI models. Yet, the data it provides is suspiciously sourced from another dataset named Mixtral. The origins of Mixtral’s information remain unknown, with Cosmopedia simply repackaging Mixtral’s data as its own.
According to AI copyright expert Ed Newton-Rex, Cosmopedia’s exercise is a form of copyright laundering. He stated, “You don’t want to train directly on copyrighted work for fear of being sued, so you train on text that was created by a model that itself was trained on copyrighted work.” Both HuggingFace and Cosmopedia label their content as “synthetic data,” potentially allowing uses of such information to avoid violating copyright laws.
However, University of New South Wales’ AI professor Toby Walsh expressed skepticism over such claims. He opined that the synthetic data could originate from copyrighted material, resulting in a disconcertingly similar resemblance to real-world samples. He also observed that the propensity for disguised data generation might further muddy the waters for copyright claims, calling it the “poison fruit” of the AI industry.
An example of copyright laundering in AI is witnessed in the squabble between AI art generators Midjourney and Stability AI. Midjourney banned Stability employees after alleging them of code theft. However, both companies allegedly use datasets trained on artists’ works. In fact, Midjourney itself has discussed how they employ numerous artists’ scraped artwork, making it challenging to trace derivative works under copyright laws.
Legal challenges against copyright laundering practices in AI are surfacing. Artists lodged a lawsuit against Midjourney and Stability AI in 2023, accusing them of presenting their AI products as copyright laundering tools, and exploiting art benefits without paying the artists.
In summary, copyright laundering in the AI world is a budding concern. AI companies using the practice can bypass copyright laws, exploit content creators, and compete unfairly. It remains to be seen how legislation and industry practices will evolve to tackle such unethical practices.