DEV Community

Simon Paxton
Simon Paxton

Posted on • Originally published at novaknown.com

AI Content Feedback Loop: Why the Internet's Truth Is Fragile

Brendan “PlayerUnknown” Greene looked at the modern web and basically said: we’re feeding AI its own junk until the junk becomes truth. That’s the AI content feedback loop in one breath.

Here’s the thing: he’s right about the loop, but the real problem isn’t inside the model. It’s in the systems that decide what rises to the top of your screen.

We keep talking about hallucinations like they’re a bug in the brain. The scarier failure mode is that search, social feeds, and ad networks quietly promote synthetic nonsense until the world rearranges itself around it.

TL;DR

  • Greene’s “race to the middle of sh*t” is real, but the dangerous part of the AI content feedback loop is discovery and monetization, not just LLM hallucinations.
  • Technically, models can avoid training on their own exhaust; economically, the web often prefers that exhaust because it’s cheap, fast, and SEO-friendly.
  • Local compute and domain models help at the edges, but unless we fix how we index, attribute, and reward content, we’re just building fancier filters on a polluted river.

Greene’s warning in one line

Greene told PC Gamer that LLMs are “scanning this junk, and then that becomes truth… it's like a race to the middle of sh*t.”

Translated:

1) AI systems churn out low-quality, often hallucinated text.

2) That text floods the internet.

3) Future models train on that flood.

4) Over time, the average quality and reliability of online “truth” collapses.

He also points out the absurdity of tools that say “please fact-check this” while people increasingly use them as the fact source.

Look, that critique lands. But we’ll miss the real failure mode if we only picture a model training on its own mistakes like some cartoon snake eating its tail.

The question to ask is: who decided that junk should be on page one of Google, or in the top 10 TikToks, or in the AI’s retrieval index at all?

For deeper context on Greene’s concern, you can see how we framed model collapse earlier in our own AI content feedback loop piece.


Why the AI content feedback loop is technically plausible

OK, so imagine you run a model-training pipeline.

You scrape the web, filter obvious spam, toss it into a giant blender, and optimize the hell out of it — exactly the kind of process discussed in the Chinchilla paper and the “stochastic parrots” critique. The model gets good enough that people start using it to write blog posts, news rewrites, product descriptions, and homework.

Now round two: you scrape the web again next year.

Unless you’re very careful, your training data now includes:

  • Original human-written pages
  • AI rewrites of those pages
  • AI rewrites of AI rewrites, SEO’d to death
  • Entire “content farms” that are basically model outputs with ads

From the model’s point of view, these all look like plausible text strings. There’s no flag that says “this paragraph was hallucinated last February at 3:07 p.m.”

So yes, the AI content feedback loop is technically plausible:

  • The share of AI-generated content in the wild is rising. Greene is right that AI-written “news” is already “staggeringly” common.
  • Most large-scale crawlers don’t reliably tag synthetic vs. human-origin text.
  • Without strong de-duplication and source tracking, the easiest text to get is the most over-produced — which is now AI.

You can design around this. You can weight sources, track provenance, down-rank near-duplicates, do clever filtering. The research community has been yelling about data provenance since “Stochastic Parrots” in 2021.

So why isn’t that enough?

Because the real loop-closing mechanism isn’t just training data. It’s what we surface as authoritative in the first place.


It’s not the math — it’s indexing and incentives

Think about how you actually encounter “truth” online:

  • You don’t open a raw dump of Common Crawl.
  • You type a query into a search box, or scroll a feed, or ask an AI assistant.
  • Some ranking system, tuned by engagement and revenue, chooses 10 items for you.

That system is the real stomach acid of the internet. It digests everything — human and machine — and decides what reality looks like for most people.

Now layer AI into that:

  1. Cheap supply. AI-generated content is orders of magnitude cheaper than human work. If you can make 10,000 articles for the price of one, you will.
  2. Engagement-tuned ranking. Discovery systems optimize for clicks, shares, and ad yield — not ground truth. If AI sludge gets clicks, sludge rises.
  3. Retrieval over web. AI search and assistants increasingly skip linking you to a source at all. They synthesize and paraphrase, then maybe sprinkle links as decoration.

Put those together and you get a nasty dynamic:

  • The discovery layer treats AI text as first-class content because it’s plentiful and cheap.
  • Users see it high in rankings, share it, screenshot it, cite it.
  • Future crawls treat those user interactions as signals that “this is important.”
  • Models trained on that corpus now think this is what “authoritative” looks like.

The loop isn’t just “model trains on model output.”

It’s: discovery amplifies synthetic output → synthetic output shapes human belief and behavior → those signals teach the next generation of discovery and generation systems what to prefer.

That’s why just “fixing hallucinations” is a partial answer. Sharper math on top of a poisoned ranking system still points you at the wrong hill.

If you want to go deeper on why this matters for business workflows specifically, our other AI content feedback loop piece looks at reliability in that context.


Why local compute and domain models are an incomplete remedy

Greene also argues for a more local, domain-specific future. Smaller models, closer to the user, trained on curated data. That’s a smart instinct.

Local or domain models can help in three big ways:

  • Tighter data curation. A hospital model trained on vetted studies is less likely to inherit garbage from random health blogs.
  • Determinism. You can tune them for more predictable behavior, instead of “vibes plus autocomplete.”
  • Reduced dependency on web slop. If your model mostly sees your own logs, docs, and decisions, the public internet has less influence.

But here’s the catch: they don’t fix the discovery problem for everyone else.

Your local model doesn’t control:

  • What your kid sees when they Google a homework question.
  • What your neighbor reads in AI-summarized news.
  • What policymakers see in AI-assisted briefings that quietly drew from contaminated corpora.

And even inside organizations, domain models are rarely hermits. They still pull in public knowledge — legal frameworks, medical guidelines, market data, academic papers. If those upstream sources have been “race to the middle of sh*t”-ified, your pristine local model is now doing high-precision work on degraded facts.

Local compute is like installing a high-end water filter in your house.

Helpful, absolutely. But if the river feeding your city is slowly turning to slurry, your filter is fighting upstream.

The real fix has to touch the indexing and incentive layer, not just where the math runs.


What a technically curious reader should do now

You can’t redesign Google or OpenAI today. But you do have levers — as a user, a publisher, or an engineer.

Here are the ones that actually matter:

  1. Treat AI answers as views, not facts.

    Use them like you’d use a smart colleague’s first draft: helpful, but never self-authenticating.

  2. Follow the links, not the summaries.

    When an AI or search snippet answers you, click through to original sources. Reward sites and authors that show methods, data, and named accountability.

  3. Publish with provenance.

    If you write, code, or post, be explicit when AI helped and link to your own sources. Make it easier for future filters to distinguish “derived from X” versus “pure word-soup.”

  4. Design retrieval before generation.

    If you build tools, put more engineering effort into what the model sees and cites than into prompt fireworks. Retrieval, citation, and source weighting are where the real defense lives.

  5. Pay for real work when you can.

    Subscribe to at least a couple of outlets, newsletters, or creators whose work you’d be sad to see replaced by gray slurry. Money is a ranking signal too.

The key insight is this: we’re not just training models on the web; we’re training the web on the models.

If we keep rewarding whatever is cheapest to generate and easiest to rank, the AI content feedback loop will do exactly what Greene fears — not because the models are mystical, but because the economics are simple.


Key Takeaways

  • Greene’s “race to the middle of sh*t” is a fair description of how unchecked AI-generated content can flood the web.
  • The dangerous part of the AI content feedback loop is the discovery layer that boosts cheap synthetic text into de facto authority.
  • Technically, models can avoid eating their own tail; economically, web platforms are incentivized to amplify whatever is abundant and monetizable.
  • Local compute and domain-specific models reduce exposure but can’t fix polluted upstream sources or global search behavior.
  • The practical defense is to change how we index, attribute, and reward information — as users, builders, and publishers.

Further Reading

In the end, the fragility of online truth isn’t a mystical AI property. It’s a ranking problem, dressed up as autocomplete.


Originally published on novaknown.com

Top comments (0)