Perplexity AI, a company that blends a search engine with generative AI to deliver AI-created content related to user search queries, has been accused of unethical data collection practices. It allegedly scraped content from several websites, including those that expressly disallow it, without proper protocol. The controversy began on June 11th when Forbes claimed that Perplexity used an entire article from its site, with custom illustrations, and only gave minimal credit for it.
Subsequently, WIRED conducted an investigation and found evidence that Perplexity was illegally scraping content from sites that do not allow automated data gathering. Websites use a “robots.txt” file to request that web crawlers do not scrape their content. This is a universally recognized code of conduct and is expected to be respected by web crawlers, although it is not legally enforceable. It assists website owners in managing their content and preventing unauthorized data collection.
Jason Kint, CEO of Digital Content Next, an online publishers representative group, criticized Perplexity’s web scraping approach asserting, “AI companies should inherently assent that they do not have the right to take and reuse publishers’ content without first obtaining permission.” He added that if Perplexity was bypassing terms of service or robots.txt files, this should be a significant warning signal of an inappropriate conduct.
Following these revelations, Amazon Web Services (AWS), hosting a server alleged to be involved in Perplexity’s unauthorized scraping, has decided to investigate. AWS strictly prevents customers from performing potentially abusive or unlawful activities that infringe its terms.
Despite the allegations, Perplexity CEO, Aravind Srinivas, initially dismissed the concerns as a gross misunderstanding of the company’s operations and the internet in general. However, in an interview with Fast Company, he admitted that Perplexity depends on an undisclosed third-party company for web crawling and indexing, indicating they were likely responsible for any potential robots.txt violations. AWS’ investigation is regarded as a regular procedure by the Perplexity management but it does not announce any changes in its operations.
Though Perplexity seems set to ride out the storm, their dismissal could be doomed if the growing worry about AI’s data practices continues to mount. The company’s data scraping controversy represents a challenging debate about data collection ethics in the age of AI and machine learning.