You configure robots.txt like this:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: *
Disallow: /
You enable Cloudflare Bot Management. You set up Akamai. Maybe even a server-side paywall.
And then you query ChatGPT about your product and it cites your website as a source.
How?
I work on GEO (Generative Engine Optimization) projects where we audit how LLMs represent brands. We routinely analyze thousands of prompt-response pairs. Across multiple projects, we consistently find that 10–20% of LLM responses cite the brand's own website as a source — even when every known bot is blocked.
Here are the 8 technical vectors we documented, with academic sources and industry data.
1. Historical crawl data (Common Crawl)
This is the biggest one and the least understood.
Common Crawl is a nonprofit that has been archiving the web since 2007. The numbers:
- 9.5+ petabytes, 300+ billion documents
- ~2/3 of the 47 LLMs published between 2019–2023 use it as training data
- GPT-3, LLaMA, T5, Red Pajama all trained on it
- Google's C4 dataset: 750 GB filtered from Common Crawl
Blocking crawlers today does not retroactively remove content already captured. Those snapshots are permanent, public resources.
Source: ACM FAccT 2024 — "A Critical Analysis of Common Crawl"
2. Client-side paywall bypass
Common Crawl does not execute JavaScript. If your paywall depends on client-side JS:
<!-- Your paywall loads after DOM ready -->
<script>
document.addEventListener('DOMContentLoaded', () => {
showPaywall();
});
</script>
<!-- But the crawler already captured the full HTML -->
The crawler gets the complete article before JS even runs.
Alex Reisner documented this for The Atlantic (Nov 2025): Common Crawl was capturing full articles from NYT, WSJ, The Economist and The Atlantic itself.
3. User-agent spoofing
Some AI bots change their identity when blocked.
Cloudflare documented (Aug 2024) that Perplexity was using:
# Declared user-agent
PerplexityBot/1.0
# What they actually sent
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0
Plus ASN rotation to evade IP-based blocking. The evasion ecosystem includes FlareSolverr (Selenium + undetected-chromedriver), Scrapfly (94–98% bypass rates), and residential proxy rotation.
4. Syndication redistribution
Once your content leaves your domain through any syndication channel, your robots.txt is irrelevant:
Original domain (robots.txt: Disallow)
→ RSS feed (no robots.txt)
→ Apple News (different domain)
→ Email newsletter (archived on web)
→ Cross-posted to social (scraped by bots)
→ API aggregators (reformatted downstream)
Each channel creates a copy outside your control.
5. Web archives (Wayback Machine)
Internet Archive: 1+ billion pages, 99+ petabytes. web.archive.org is domain #187 in Google's C4 dataset.
Harvard's WARC-GPT lets you ingest WARC archives directly into RAG pipelines. As of Feb 2026, publishers like The Guardian and NYT started blocking Wayback Machine over AI concerns.
6. Real-time RAG access
Modern LLMs don't just rely on training data. They fetch content in real time:
| Bot | Growth 2024–2025 | Mechanism |
|---|---|---|
| ChatGPT-User | +2,825% | Fetch on user "search the web" |
| PerplexityBot | +157,490% | Fetch on every query |
| Meta-ExternalFetcher | New in 2024 | Meta AI features |
These bots claim the fetch is "user-initiated" (not autonomous crawling), trying to exempt themselves from robots.txt.
Cloudflare reported Anthropic's bots have crawl-to-refer ratios of 38,000:1 to 70,000:1. For every time they send traffic back, they crawl tens of thousands of times.
Sources: Cloudflare Blog 2025, OpenAI Crawlers Overview
7. Content farms
Content farms — human or AI-operated — rewrite your articles on unrestricted domains:
1. Scrape/copy original article
2. Rewrite to avoid plagiarism detection
3. Publish on domain with no robots.txt restrictions
4. AI crawler indexes the rewrite
5. LLM absorbs the rewritten version
In Bartz v. Anthropic PBC, the court ruled that training AI with content from "pirate sites" constituted fair use. This sets precedent for rewritten content too.
8. Direct non-compliance
The simplest vector: bots just ignore robots.txt.
- 12.9% of bots ignore it entirely (was 3.3%) — Paul Calvano, Aug 2025
- Duke University (2025): "several categories of AI-related crawlers never request robots.txt"
- Kim & Bock (ACM IMC 2025): scrapers are less likely to comply with more restrictive directives
The legal status is clear: in Ziff Davis v. OpenAI (2025), the judge described robots.txt as "more like a sign than a fence" — not a technological measure that "effectively controls access" under the DMCA.
The compliance stats
| Metric | Value | Source |
|---|---|---|
| Bots ignoring robots.txt | 12.9% | Paul Calvano, 2025 |
| Top 10K sites with AI bot rules | Only 14% | Market analysis 2025 |
| Sites with any robots.txt | 94% (12.2M sites) | Global study 2025 |
So what do you do?
Blocking alone doesn't work. Defensive measures reduce direct crawling by 40–60% for compliant bots, but they can't touch historical data, syndicated copies, or content farm rewrites.
The alternative is offensive: control the narrative instead of trying to hide from it.
At 498 Advance we built tools for this: GEOdoctor for technical auditing of brand visibility in LLMs, and S.A.M. (Semantic Alignment Machine) for content alignment across owned media, UGC platforms (social GEO) and authority domains.
Full analysis with all academic sources: zoopa.es/en/blog
Have you run into this paradox? Blocking everything but still appearing in LLM outputs? I'd love to hear what you've observed in your own infrastructure.
Top comments (0)