Reddit Escalates Legal War on AI Data Scraping, Sues Perplexity and Data Brokers

Reddit Takes Legal Action Against AI Firm and Data Suppliers

Reddit has intensified its campaign against unauthorized data harvesting by filing a federal lawsuit against Perplexity AI and three data scraping companies. The complaint, lodged in the Southern District of New York, alleges systematic theft of Reddit’s content through sophisticated evasion techniques.

Reddit Takes Legal Action Against AI Firm and Data Suppliers
The “Data Laundering Economy” Behind AI Training
Perplexity’s Alleged Role as Willing Customer
Legal Framework and Broader Industry Implications
Industry Responses and Future Implications

The social media platform accuses Oxylabs UAB, AWM Proxy, and SerpApi of orchestrating a coordinated effort to bypass both Reddit’s and Google’s anti-scraping defenses. According to court documents, these companies allegedly employed identity masking, location hiding, and web scraper disguise techniques to illegally extract Reddit content and associated search results.

The “Data Laundering Economy” Behind AI Training

Reddit’s Chief Legal Officer Ben Lee described an emerging “industrial scale data laundering economy” driven by AI companies‘ insatiable appetite for quality human-generated content. “Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material,” Lee stated in an emailed declaration. He emphasized that Reddit represents a particularly valuable target as “one of the largest and most dynamic collections of human conversation ever created.”

The lawsuit portrays the three data providers as key enablers in this ecosystem. Oxylabs UAB, operating from Lithuania, presents itself as an ethical data solutions provider, while AWM Proxy allegedly originated as a Russian botnet operation. SerpApi explicitly markets real-time access to scraped Google search results, according to the complaint.

Perplexity’s Alleged Role as Willing Customer

Reddit’s legal filing takes particular aim at Perplexity AI, characterizing the company as a “willing customer” of stolen data. The complaint alleges that rather than pursuing lawful licensing agreements with Reddit, Perplexity chose to purchase illicitly obtained content through these data brokers.

The lawsuit employs striking analogies, comparing the data providers to “would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead.” Even more pointedly, Reddit’s complaint echoes Cloudflare CEO Matthew Prince’s characterization of Perplexity as operating “more akin to a ‘North Korean hacker’” in its approach to data acquisition.

Legal Framework and Broader Industry Implications

Reddit’s complaint asserts multiple violations of the Digital Millennium Copyright Act, specifically targeting the circumvention of technological protections against automated access. The platform also brings claims of unfair competition, unjust enrichment, and civil conspiracy against the defendants., as comprehensive coverage

This lawsuit represents the latest escalation in Reddit’s broader strategy to protect and monetize its content. In June, the company filed similar proceedings against Anthropic after failing to secure a content licensing agreement similar to the one reached with OpenAI. The pattern demonstrates Reddit’s commitment to establishing that unauthorized scraping for AI training constitutes copyright infringement.

The legal action occurs against a backdrop of increasing industry-wide scrutiny of AI training practices. Recent months have seen multiple high-profile cases, including lawsuits alleging that Apple used pirated books from the Books3 dataset and that OpenAI improperly scraped YouTube videos. The New York Times case against Microsoft and OpenAI similarly addresses unauthorized use of copyrighted news content.

Industry Responses and Future Implications

Perplexity responded to the allegations through a spokesperson, stating: “Perplexity has not yet received the lawsuit, but we will always fight vigorously for users’ rights to freely and fairly access public knowledge. Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest.”

Neither Oxylabs nor SerpApi immediately responded to requests for comment regarding the allegations. Oxylabs’ website describes the company as “the largest ethical proxy network and advanced scraping solutions empowering the AI industry and beyond.”

As AI companies continue to seek high-quality training data, the outcome of this and similar cases will likely establish crucial precedents governing data scraping practices. The resolution could fundamentally reshape how AI firms access and utilize publicly available online content, potentially forcing greater transparency and licensing compliance across the industry.

Reddit seeks both injunctive relief to stop the alleged scraping activities and monetary damages for the unauthorized use of its content. The case represents a significant test of how traditional copyright frameworks apply to the rapidly evolving landscape of AI development and training methodologies.