• Home  
  • AI Training Cheating Threatens LLM Quality
- Artificial Intelligence

AI Training Cheating Threatens LLM Quality

Workers paid to improve AI are using ChatGPT to train models, risking data collapse and lower performance, says New Scientist Tech.

AI Training Cheating Threatens LLM Quality

It’s startling that the very people hired to feed future large language models with clean conversation data are turning to the models themselves for a shortcut. That piece of AI training cheating is at the heart of a new whistleblower report that suggests the practice could erode the quality of the next generation of AI.

Key Takeaways

  • Workers tasked with high‑quality data collection are using ChatGPT and similar tools to generate that data.
  • Company policies explicitly forbid the practice, yet monitoring tools only catch a minority of offenders.
  • Researchers warn that recursive training on AI‑generated text can cause “model collapse” or reduced human‑like performance.
  • Low‑pay, contract‑based work arrangements appear to be a key driver of the shortcuts.
  • Industry giants like Meta, Cisco, Google and Scale AI are implicated indirectly through their outsourcing pipelines.

Historical context: how data pipelines grew

Early large language models relied on small, carefully curated corpora. Teams of in‑house linguists annotated sentences, corrected grammar, and flagged bias. As model sizes exploded, that manual approach hit a ceiling. Companies turned to external platforms that could supply millions of conversational turns at a fraction of the cost. The shift happened gradually, but by the time models reached the scale of today’s flagship systems, the bulk of data collection was outsourced to gig‑economy workers.

Outsourcing introduced a new set of trade‑offs. On the one hand, it delivered the volume needed to train models that could understand nuanced prompts. On the other, it created a supply chain where contracts were short, wages modest, and supervision limited to screen captures or periodic audits. That environment laid the groundwork for the shortcuts described by Alice, Bob and Carol. When the incentive to meet quotas outweighs the desire to follow strict guidelines, the temptation to let an LLM do the heavy lifting becomes hard to resist.

In parallel, the tools that workers now use—ChatGPT, Claude, Gemini—have become more capable than the early rule‑based generators that once powered data‑augmentation experiments. The same models that power the final product are now being looped back into the training loop, creating a feedback cycle that the industry never intended to close.

AI training cheating undermines data quality

When you hear that workers are feeding LLMs with the output of other LLMs, it feels like a paradox that could unwind the whole training pipeline. Alice, a veteran data‑collector, says the practice is “very widespread” and that even explicit guidelines can’t stop it.

“It’s very widespread; every company I’ve worked for has had explicit guidelines around it and they clearly do try to catch people out, so I think they do care. But I don’t think they can stop it,”

she told New Scientist. The problem isn’t just a breach of policy; it’s a contamination risk that could make future models less capable of handling genuinely human prompts.

Companies have been betting on third‑party platforms to scale up their data collection as models grow larger. Those platforms, like Outlier, employ workers on a piecemeal basis—often without full‑time contracts and for modest wages. The low‑pay structure, Alice notes, nudges people toward shortcuts: “If these companies want quality data, then they should offer quality contracts,” she says. In practice, many workers are simply prompting ChatGPT, tweaking the output to dodge obvious AI hallmarks, and then submitting that as original conversation.

Workers’ incentives and low‑pay contracts

Bob, who rose from a trainee to a supervisory role at Outlier, described a day‑to‑day reality where his team was tracked by Hubstaff screenshots. “People would have it [AI models like ChatGPT] open in other tabs, or minimised, so obviously we could see it in the task bar,” he explains. The surveillance was meant to enforce compliance, yet the very fact that a manager had to hunt for hidden AI windows shows how pervasive the problem is. Bob adds that the company vacillated between light tolerance and outright banning, a flip‑flop that left workers guessing what was acceptable.

Carol, who’s worked across several platforms, says she initially used AI to double‑check that her work met the lengthy guidelines. “I was terrified of not having an income source,” she admits, noting that the pressure to avoid expulsion drove her toward more extensive reliance on LLMs. She now runs entire scenario‑creation pipelines through one model, then uses another to generate supporting files. “I do feel guilty but like I said, in the beginning it was more about trying to make sure I wasn’t making any errors,” she says.

Company policies vs reality

Outlier, owned by Scale AI, didn’t respond to requests for comment, and Scale AI’s website simply lists a roster of tech giants it works with, including Meta and Cisco. Neither of those firms replied to New Scientist’s inquiries. The lack of response adds to the opacity surrounding how many layers of subcontractors might be involved in the data‑generation chain. What’s clear is that the guidelines exist on paper, but enforcement is spotty at best.

Bob’s screenshot‑monitoring approach illustrates one of the few concrete attempts to police the practice. He says that even folder names on a worker’s desktop could give away the use of AI. “Even stuff like folders on their desktop with names gave it [AI use] away,” he notes. The fact that such low‑tech signals are the primary detection method suggests that companies lack more sophisticated tools to differentiate human‑written from machine‑generated text.

Detection gaps

  • Random screenshots catch only a fraction of violations.
  • Workers can mask AI output by avoiding obvious markers like em‑dashes.
  • Current guidelines rely on manual review rather than automated detection.

Scientific warning: AI cannibalism

Mark Lee at the University of Birmingham warns that feeding models increasingly AI‑generated content could trigger what researchers call “model collapse” or “AI cannibalism.” He explains that when a model is recursively trained on its own output, its abilities can drop dramatically. “That’s the kind of worst‑case scenario. And that’s probably not what’s happening in the real world,” Lee says, noting that a small fraction of human‑generated data—around 10 per cent—can mitigate the risk.

Lee adds that the cheating described by Alice, Bob and Carol isn’t likely to cause an immediate catastrophe, but it will degrade performance over time. “Rather than it being catastrophic, you’ll see that the AI isn’t as good at doing human‑like tasks. It’s an issue, because I think the models aren’t as good as they could be,” he says. The implication is that the industry could be silently eroding the very edge that makes LLMs valuable for developers and end‑users.

Adoption timeline and risk mitigation

Most firms that deploy LLMs for commercial products have already rolled out versions that contain data collected through the outsourcing model described above. The next wave of upgrades will likely reuse much of the same pipelines, unless a decisive shift in procurement strategy occurs. That means the risk of “AI cannibalism” is not a distant theoretical concern—it’s baked into the upgrade schedule that many product teams are already following.

Some companies are experimenting with hybrid approaches. They combine a core of verified human‑generated conversations with a larger, machine‑generated supplement. The idea is to keep the proportion of authentic data high enough to prevent the degradation highlighted by Lee. Early internal tests suggest that even a modest infusion of clean data can keep performance stable, but those tests are still private and not widely shared.

From a governance perspective, the industry is still figuring out which signals are reliable enough to trigger an automated flag. Text‑entropy, repetition patterns, and the presence of certain token sequences have been proposed, but none have proven universally effective. Until a strong detection stack is in place, human reviewers will remain the last line of defense, and that reliance on manual oversight is a bottleneck that scales poorly.

What This Means For You

If you’re building applications that rely on third‑party LLMs, you’ll want to audit the provenance of the training data. Look for contracts that stipulate human‑only data collection, and consider adding your own verification layers—perhaps by running sample inputs through a detection model before you accept the output as training material. Don’t assume that a vendor’s compliance badge guarantees clean data; the whistleblowers’ stories show that enforcement can be lax.

Developers should also prepare for a potential dip in model performance as the industry grapples with this issue. That could mean adjusting expectations for accuracy, especially in nuanced, human‑centric tasks, and possibly budgeting for additional fine‑tuning on curated datasets. If you’re a founder, think about how much you’re willing to pay for genuinely high‑quality data versus cheaper, riskier pipelines. The cost of a model that underperforms could far outweigh the savings from low‑pay contracts.

Scenario 1: A startup uses a third‑party LLM to power a chatbot for customer support. The team assumes the vendor’s data is all human‑sourced. After a month of low satisfaction scores, they discover that the model frequently repeats phrasing that sounds mechanical. By running a small audit and switching to a vendor that guarantees human‑only conversations, they raise the bot’s relevance score and see a measurable boost in user retention.

Scenario 2: An enterprise integrates an LLM into its internal knowledge‑base search. The system returns answers that are factually correct but phrased in a way that feels “too perfect.” The IT group suspects the underlying training set contains AI‑generated text that lacks the subtlety of real‑world language. They introduce an additional fine‑tuning step on a vetted internal corpus, and the search results become more aligned with employee expectations.

Scenario 3: A researcher fine‑tunes a public model for a niche domain. The baseline model already contains a mix of human and machine‑generated content. By carefully curating a domain‑specific dataset that is entirely human‑annotated, the researcher avoids compounding the contamination risk and ends up with a model that outperforms the baseline on domain‑specific benchmarks.

Will the community double‑down on stricter oversight, or will it accept a slower, noisier path to scale? The answer will shape whether future LLMs retain their human‑like edge or drift toward a self‑reinforcing echo chamber.

Key questions remaining

  • How will regulators respond if evidence of widespread data contamination emerges?
  • Can automated detection keep pace with the rapid evolution of text‑generation models?
  • Will vendors be willing to overhaul contract terms to attract higher‑quality data workers?
  • What safeguards can open‑source communities put in place without stifling innovation?

Sources: New Scientist Tech, BBC News

About AI Post Daily

Independent coverage of artificial intelligence, machine learning, cybersecurity, and the technology shaping our future.

Contact: Get in touch

We use cookies to personalize content and ads, and to analyze traffic. By using this site, you agree to our Privacy Policy.