Your robots.txt says GPTBot is welcome. Your server says 403.

Marius Orzaru — Fri, 22 May 2026 13:40:16 +0000

Your robots.txt lists User-agent: GPTBot - Allow: /. The page loads fine in a browser. The "AI crawler" checkers say you're configured correctly. But every time ChatGPT-User actually fetches your site, it gets a 403. You don't show up in ChatGPT when people ask about your product. You don't show up in Perplexity. The standard tools can't see why, because they're reading the wrong file.

This is the most common AI crawler accessibility failure in 2026, and almost nothing on the open web explains it correctly. Most write-ups stop at "here are five user-agents, add them to your robots.txt." That's table stakes. The actual blocks happen one layer up - at the CDN, at the WAF, in the JS shell of an SPA - and you can configure robots.txt perfectly while still being invisible to every model that matters.

The three ways your site gets blocked

Three layers. They fail for different reasons, they need different fixes, and from the outside they all look the same: your site, missing from ChatGPT, no obvious cause. Most write-ups treat them as one thing. That's how readers end up patching the wrong layer.

Layer 1: robots.txt disallow (application layer)

The obvious case. Your robots.txt explicitly disallows an AI user-agent, or disallows * and never re-enabled the bots you actually wanted.

# Common failure mode: copied from a staging config
User-agent: *
Disallow: /

# Or the version that explicitly blocks AI bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

How to test: fetch /robots.txt directly and grep for AI user-agents. This is what every "AI crawler" tool already does. If this is your problem, the fix takes thirty seconds. The reason it gets so much airtime is that it's the easiest failure to detect and explain. Not because it's the most common.

Layer 2: CDN / WAF edge block

This is the failure mode that's killing 2026 AI visibility for most sites that "did everything right." Your origin never sees the request. Cloudflare, AWS WAF, Fastly, or a custom edge rule (the one someone added at 2am after a scraper incident and nobody has touched since) returns a 403 before robots.txt gets read.

The tell: your robots.txt is permissive. The bots get blocked anyway. Parsers say everything is fine.

# What a healthy response looks like
$ curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://your-site.com
HTTP/2 200
content-type: text/html; charset=utf-8

# What an edge-level block looks like
$ curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://your-site.com
HTTP/2 403
server: cloudflare
cf-ray: 8b9c2f1e4a8d3c12-FRA

server: cloudflare plus a 403 or 429 means the request died at the edge. Same shape with server: AmazonS3 and a WAF rule, or via: 1.1 fastly. We'll go deep on Cloudflare below; it's the biggest source of silent blocks we see in practice.

Layer 3: origin or application block

The rest happens at your server. Less common than edge blocks. Easier to hide:

Custom user-agent filtering. Someone added if (ua.includes("Bot")) return 403 to middleware years ago. It catches GPTBot along with everything else.
Rate limiting. Per-IP limits hit AI crawlers harder than human traffic because the crawler IPs are concentrated in a handful of datacenters.
Geo-blocking. AI bots fetch from regions your geo rules don't trust.
JS-rendering invisibility. 200 OK, empty body, model walks away with nothing. Worth its own section, coming up.

How to test: curl your page with each AI user-agent and read the response body. Don't just check the status code. A 200 with no content is a 200 that means nothing to a language model.

The bots that matter in 2026

Most "AI crawler" lists copy each other and never explain what each bot actually does. Here's the practical version, sorted by what it costs you to block each one.

USER-AGENT              · PURPOSE                          · BLOCKING IMPACT                       · SEVERITY
─────────────────────────────────────────────────────────────────────────────────────────────────────────────
GPTBot                  · OpenAI training crawler          · Excluded from future GPT training     · low–medium *
ChatGPT-User            · ChatGPT live retrieval           · Invisible in ChatGPT answers          · CRITICAL
OAI-SearchBot           · ChatGPT Search index             · Excluded from ChatGPT Search          · HIGH
ClaudeBot               · Anthropic training crawler       · Excluded from Claude training         · low–medium *
Claude-User             · Claude live retrieval            · Invisible in Claude answers           · CRITICAL
anthropic-ai            · Legacy Anthropic UA              · Same as Claude-User (older clients)   · HIGH
PerplexityBot           · Perplexity index                 · Excluded from Perplexity              · HIGH
Perplexity-User         · Perplexity live retrieval        · Invisible to Perplexity queries       · CRITICAL
Google-Extended         · Gemini + AI Overviews            · Excluded from Google's AI surfaces    · HIGH
Applebot-Extended       · Apple Intelligence training      · Excluded from Apple AI features       · LOW
meta-externalagent      · Meta AI training / retrieval     · Excluded from Meta AI                 · MEDIUM
Bytespider              · ByteDance crawler                · Excluded from ByteDance AI products   · LOW

* Blocking a training crawler is a legitimate choice; lots of sites opt out and consider that fine. Blocking a live retrieval crawler is almost always an accident that destroys your AI visibility.

This distinction is the only AI crawler concept that actually matters. Everything else is footnotes. Two categories, opposite blast radius (and yes, the naming is genuinely awful — ChatGPT-User and GPTBot sound interchangeable, they aren't):

Training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended). They index your content for future model training. Opting out keeps you out of training data. Your live AI visibility doesn't move.
Live-retrieval crawlers (ChatGPT-User, Claude-User, Perplexity-User). They fetch a page right now because a human asked a question that needs it. Blocking these is what makes you invisible in the answer.

Every "should I block AI?" debate that skips this distinction is wasted oxygen. You can opt out of training and still show up in answers. The configuration is just different.

The Cloudflare default block problem

In July 2024, Cloudflare shipped a one-click "Block AI Scrapers and Crawlers" toggle and turned it on by default for new free-plan zones. It blocks at the edge, runs before your origin sees the request, and bypasses robots.txt entirely.

This single setting is likely responsible for more silent AI invisibility in 2026 than every misconfigured robots.txt combined. Three things make it especially destructive:

It's on by default for many zones. Anyone who created a Cloudflare site in the last 18 months may have it enabled without knowing.
Standard tools can't see it. Cloudflare blocks before your origin runs. robots.txt is served by your origin. Parsers only see what the origin says — they're talking to a server that has no idea the conversation happened.
It blocks live retrieval too. It doesn't distinguish training crawlers from live-retrieval ones. ChatGPT-User gets the same 403 as GPTBot.

Picture Cloudflare as a bouncer at the door. The bouncer checks the user-agent on the ID and decides whether to let the request through. Your robots.txt is a sign on the wall inside the building. The bouncer never reads it. The bot never gets close enough to.

Anyone running a Cloudflare zone created since mid-2024 who hasn't checked their AI Audit settings should assume the bots are blocked until they've proven otherwise.

How to verify

Run the curl tests from the previous section. If you see server: cloudflare with a 403 on bot user-agents but a 200 on a regular browser UA, this is what's happening:

# Browser UA — passes
$ curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36" -I https://your-site.com
HTTP/2 200

# AI bot UA — blocked at the edge
$ curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://your-site.com
HTTP/2 403
server: cloudflare
cf-mitigated: challenge

To confirm in the dashboard: Cloudflare → Security → Bots → Configure. Look at the AI Audit / "Block AI Crawlers" toggle and the Super Bot Fight Mode settings.

How to fix

Three options, in increasing order of granularity:

1. Turn the global AI block off
   → Cloudflare → Security → Bots → uncheck "Block AI Scrapers"
   → Use this if you want to be discoverable in all AI surfaces.

2. Allow specific bots, block the rest
   → Cloudflare → Security → WAF → Create a custom rule
   → Match: http.user_agent contains "ChatGPT-User"
   → Action: Skip
   → Useful if you want to block training but allow live retrieval.

3. Use the verified-bot allowlist
   → Cloudflare maintains a list of verified AI bots that bypass blocks.
   → Settings → Bots → Verified Bots → review which AI categories you trust.

The same pattern shows up across edge providers. AWS WAF has managed rule groups that block AI bots by user-agent (AWS-AWSManagedRulesBotControlRuleSet). Fastly customers have written custom VCL to do the same. If you're on any CDN, the question to ask is: is anything filtering by user-agent before my origin sees the request?

The JS-rendering trap

There's a fourth way to be invisible that isn't technically a block, and it's worse because every diagnostic lies. Your server returns 200. Your headers look healthy. The bot fetches your page and walks away with nothing.

AI crawlers don't run JavaScript. They read the initial HTML payload and stop. (Yes, Google's been rendering JS for search since 2019. The AI crawlers haven't caught up. They probably won't soon.) If your site is a client-rendered SPA, the bot is reading an empty <body> and a div#root that never gets populated.

# Server returns 200, but the body is empty
$ curl -s -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" https://your-spa.com | wc -c
1247
# 1247 bytes — basically just the shell. The actual content is rendered by JS.

If you want to be more rigorous:

# Pipe the response through a text extractor and count actual content
$ curl -s -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" https://your-spa.com \
    | sed 's/<[^>]*>//g' | tr -s '[:space:]' ' ' | wc -w
14
# 14 words of content visible to GPTBot. The page actually has 1,200.

Or just open dev tools, disable JavaScript, and reload. If your content disappears, the AI crawlers are seeing the same blank page.

The fix is server-rendering or static generation. Astro, Remix, and SvelteKit default to SSR. The pattern that breaks is the pure SPA. CRA, Vite without SSR, anything that ships an empty index.html and hydrates from there.

Not a quick fix. But if you've ruled out robots.txt and edge blocks and you're still invisible, this is probably what's happening. The bot can't see your content because there isn't any to see.

The opt-outs that actually matter

Not every AI bot deserves the same answer. Treating "block AI" as a binary instead of a per-bot judgment call is how you end up either too permissive (free training data, no upside) or too restrictive (invisible in answers you wanted to be in). A short opinionated breakdown:

Always allow live-retrieval bots. ChatGPT-User, Claude-User, Perplexity-User. There is no downside. These fetch your page only when a human is actively asking a question that points at your content. Blocking them is a self-inflicted wound.

Allow training bots only if you want to be in training data. GPTBot, ClaudeBot, anthropic-ai. Opting out is a legitimate choice that more sites are making, especially publishers and SaaS companies who'd rather not have their docs used as gradient updates. Your live AI visibility isn't affected either way.

Google-Extended is the complicated one. It controls Gemini and AI Overviews. Opting out keeps you out of those surfaces, which is increasingly costly as AI Overviews show up on a growing share of Google searches. Google-Extended is separate from Googlebot (disallowing the former has zero effect on your regular Google rankings). Lots of sites that thought they were opting out of "AI" tank their Gemini visibility and change nothing about their search traffic. Which is dumb.

Applebot-Extended, meta-externalagent, Bytespider are lower-stakes. These bots feed AI surfaces with much smaller market share. Decide on principle, not blast radius.

The framing that helps: every AI bot is either a customer (live retrieval, sends users back to you) or a vendor (training, builds models that may or may not link back to you). Most sites should let the customers in.

The 30-second version

Configuring robots.txt right in 2026 keeps you from being trivially invisible. It doesn't make you visible. The failure mode killing AI visibility for most sites isn't a missing robots.txt directive. It's an edge-level block they didn't know was on, or a JS-rendered page the crawler can't read. If robots.txt is the only thing you test, you're checking the layer where almost nothing actually goes wrong.

Test the live fetch. Run it under every bot UA you care about. And don't just check the status code — read what came back. Your site might be one Cloudflare toggle away from being invisible to half the web in 2027 - and that toggle is already there, waiting.

DEV Community: Marius Orzaru