Tech giants like Apple, NVIDIA, Anthropic, and Salesforce are allegedly pushing ethical boundaries by using a dataset to train their AI models, which includes subtitles from more than 170,000 YouTube videos, obtained without the consent of content creators. This potentially violates YouTube’s terms of service and raises concerns about data privacy, fair competition, and concentration of power and talent in the AI industry.
This dataset, involving major institutional content from Harvard, popular YouTubers such as MrBeast and PewDiePie, and significant news outlets including The Wall Street Journal and the BBC, was revealed by an investigation by Proof News and WIRED. While YouTube hasn’t reacted yet, it previously stated that OpenAI’s use of its videos to train its text-to-video model Sora would be a clear terms of services violation.
Tech companies have faced criticism for data usage practices before. In 2018, Facebook faced backlash over the Cambridge Analytica case where user data was harvested without consent. In 2023, it was discovered that Books3, a dataset of over 180,000 copyrighted books, had been used without author permission to train AI models, leading to a surge in lawsuits against AI companies for copyright infringement. Many industry titans like Universal Music Group, Sony Music, and Warner Records have initiated these lawsuits.
In Microsoft’s case, the tech giant’s mass hiring from AI startup Inflection has raised concerns about stifling competition in the AI sector. The UK’s Competition and Markets Authority (CMA) is investigating this movement of employees to determine if it comprises a de facto merger. Also, Microsoft’s previous investment of about $13 billion in OpenAI was withdrawn, presumably to satisfy antitrust authorities.
These incidents have eroded public trust in big tech companies, leading content creators to guard their work more fiercely against exploitation. The homogenizing of AI development by concentration of talent in a few large companies is also a concern for the industry. To rebuild this trust, tech companies may need to do more than comply with future regulations and antitrust investigations – they may need to address questions about whether the potential of AI can be harnessed while preserving ethics, fair competition, and public trust.