carlosortet

Posted on Mar 23 • Originally published at zoopa.es

robots.txt is a sign, not a fence: 8 technical vectors through which AI still reads your website

#webdev #ai #devops #security

You configure robots.txt like this:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: *
Disallow: /

You enable Cloudflare Bot Management. You set up Akamai. Maybe even a server-side paywall.

And then you query ChatGPT about your product and it cites your website as a source.

How?

I work on GEO (Generative Engine Optimization) projects where we audit how LLMs represent brands. We routinely analyze thousands of prompt-response pairs. Across multiple projects, we consistently find that 10–20% of LLM responses cite the brand's own website as a source — even when every known bot is blocked.

Here are the 8 technical vectors we documented, with academic sources and industry data.

1. Historical crawl data (Common Crawl)

This is the biggest one and the least understood.

Common Crawl is a nonprofit that has been archiving the web since 2007. The numbers:

9.5+ petabytes, 300+ billion documents
~2/3 of the 47 LLMs published between 2019–2023 use it as training data
GPT-3, LLaMA, T5, Red Pajama all trained on it
Google's C4 dataset: 750 GB filtered from Common Crawl

Blocking crawlers today does not retroactively remove content already captured. Those snapshots are permanent, public resources.

Source: ACM FAccT 2024 — "A Critical Analysis of Common Crawl"

2. Client-side paywall bypass

Common Crawl does not execute JavaScript. If your paywall depends on client-side JS:

<!-- Your paywall loads after DOM ready -->
<script>
  document.addEventListener('DOMContentLoaded', () => {
    showPaywall();
  });
</script>

<!-- But the crawler already captured the full HTML -->

The crawler gets the complete article before JS even runs.

Alex Reisner documented this for The Atlantic (Nov 2025): Common Crawl was capturing full articles from NYT, WSJ, The Economist and The Atlantic itself.

3. User-agent spoofing

Some AI bots change their identity when blocked.

Cloudflare documented (Aug 2024) that Perplexity was using:

# Declared user-agent
PerplexityBot/1.0

# What they actually sent
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0

Plus ASN rotation to evade IP-based blocking. The evasion ecosystem includes FlareSolverr (Selenium + undetected-chromedriver), Scrapfly (94–98% bypass rates), and residential proxy rotation.

4. Syndication redistribution

Once your content leaves your domain through any syndication channel, your robots.txt is irrelevant:

Original domain (robots.txt: Disallow)
  → RSS feed (no robots.txt)
  → Apple News (different domain)
  → Email newsletter (archived on web)
  → Cross-posted to social (scraped by bots)
  → API aggregators (reformatted downstream)

Each channel creates a copy outside your control.

5. Web archives (Wayback Machine)

Internet Archive: 1+ billion pages, 99+ petabytes. web.archive.org is domain #187 in Google's C4 dataset.

Harvard's WARC-GPT lets you ingest WARC archives directly into RAG pipelines. As of Feb 2026, publishers like The Guardian and NYT started blocking Wayback Machine over AI concerns.

6. Real-time RAG access

Modern LLMs don't just rely on training data. They fetch content in real time:

Bot	Growth 2024–2025	Mechanism
ChatGPT-User	+2,825%	Fetch on user "search the web"
PerplexityBot	+157,490%	Fetch on every query
Meta-ExternalFetcher	New in 2024	Meta AI features

These bots claim the fetch is "user-initiated" (not autonomous crawling), trying to exempt themselves from robots.txt.

Cloudflare reported Anthropic's bots have crawl-to-refer ratios of 38,000:1 to 70,000:1. For every time they send traffic back, they crawl tens of thousands of times.

Sources: Cloudflare Blog 2025, OpenAI Crawlers Overview

7. Content farms

Content farms — human or AI-operated — rewrite your articles on unrestricted domains:

1. Scrape/copy original article
2. Rewrite to avoid plagiarism detection
3. Publish on domain with no robots.txt restrictions
4. AI crawler indexes the rewrite
5. LLM absorbs the rewritten version

In Bartz v. Anthropic PBC, the court ruled that training AI with content from "pirate sites" constituted fair use. This sets precedent for rewritten content too.

8. Direct non-compliance

The simplest vector: bots just ignore robots.txt.

12.9% of bots ignore it entirely (was 3.3%) — Paul Calvano, Aug 2025
Duke University (2025): "several categories of AI-related crawlers never request robots.txt"
Kim & Bock (ACM IMC 2025): scrapers are less likely to comply with more restrictive directives

The legal status is clear: in Ziff Davis v. OpenAI (2025), the judge described robots.txt as "more like a sign than a fence" — not a technological measure that "effectively controls access" under the DMCA.

The compliance stats

Metric	Value	Source
Bots ignoring robots.txt	12.9%	Paul Calvano, 2025
Top 10K sites with AI bot rules	Only 14%	Market analysis 2025
Sites with any robots.txt	94% (12.2M sites)	Global study 2025

So what do you do?

Blocking alone doesn't work. Defensive measures reduce direct crawling by 40–60% for compliant bots, but they can't touch historical data, syndicated copies, or content farm rewrites.

The alternative is offensive: control the narrative instead of trying to hide from it.

At 498 Advance we built tools for this: GEOdoctor for technical auditing of brand visibility in LLMs, and S.A.M. (Semantic Alignment Machine) for content alignment across owned media, UGC platforms (social GEO) and authority domains.

Full analysis with all academic sources: zoopa.es/en/blog

Have you run into this paradox? Blocking everything but still appearing in LLM outputs? I'd love to hear what you've observed in your own infrastructure.

DEV Community