<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marius Orzaru</title>
    <description>The latest articles on DEV Community by Marius Orzaru (@orzmar).</description>
    <link>https://dev.to/orzmar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3946154%2F4aa12528-5908-4c97-9f83-26d04f931e4e.png</url>
      <title>DEV Community: Marius Orzaru</title>
      <link>https://dev.to/orzmar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/orzmar"/>
    <language>en</language>
    <item>
      <title>Your robots.txt says GPTBot is welcome. Your server says 403.</title>
      <dc:creator>Marius Orzaru</dc:creator>
      <pubDate>Fri, 22 May 2026 13:40:16 +0000</pubDate>
      <link>https://dev.to/orzmar/your-robotstxt-says-gptbot-is-welcome-your-server-says-403-9f2</link>
      <guid>https://dev.to/orzmar/your-robotstxt-says-gptbot-is-welcome-your-server-says-403-9f2</guid>
      <description>&lt;p&gt;Your &lt;code&gt;robots.txt&lt;/code&gt; lists &lt;code&gt;User-agent: GPTBot - Allow: /&lt;/code&gt;. The page loads fine in a browser. The "AI crawler" checkers say you're configured correctly. But every time &lt;code&gt;ChatGPT-User&lt;/code&gt; actually fetches your site, it gets a &lt;code&gt;403&lt;/code&gt;. You don't show up in ChatGPT when people ask about your product. You don't show up in Perplexity. The standard tools can't see why, because they're reading the wrong file.&lt;/p&gt;

&lt;p&gt;This is the most common AI crawler accessibility failure in 2026, and almost nothing on the open web explains it correctly. Most write-ups stop at "here are five user-agents, add them to your &lt;code&gt;robots.txt&lt;/code&gt;." That's table stakes. The actual blocks happen one layer up - at the CDN, at the WAF, in the JS shell of an SPA - and you can configure &lt;code&gt;robots.txt&lt;/code&gt; perfectly while still being invisible to every model that matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three ways your site gets blocked
&lt;/h2&gt;

&lt;p&gt;Three layers. They fail for different reasons, they need different fixes, and from the outside they all look the same: your site, missing from ChatGPT, no obvious cause. Most write-ups treat them as one thing. That's how readers end up patching the wrong layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: robots.txt disallow (application layer)
&lt;/h3&gt;

&lt;p&gt;The obvious case. Your &lt;code&gt;robots.txt&lt;/code&gt; explicitly disallows an AI user-agent, or disallows &lt;code&gt;*&lt;/code&gt; and never re-enabled the bots you actually wanted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Common failure mode: copied from a staging config
User-agent: *
Disallow: /
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Or the version that explicitly blocks AI bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How to test: fetch &lt;code&gt;/robots.txt&lt;/code&gt; directly and grep for AI user-agents. This is what every "AI crawler" tool already does. If this is your problem, the fix takes thirty seconds. The reason it gets so much airtime is that it's the easiest failure to detect and explain. Not because it's the most common.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: CDN / WAF edge block
&lt;/h3&gt;

&lt;p&gt;This is the failure mode that's killing 2026 AI visibility for most sites that "did everything right." Your origin never sees the request. Cloudflare, AWS WAF, Fastly, or a custom edge rule (the one someone added at 2am after a scraper incident and nobody has touched since) returns a &lt;code&gt;403&lt;/code&gt; before &lt;code&gt;robots.txt&lt;/code&gt; gets read.&lt;/p&gt;

&lt;p&gt;The tell: your &lt;code&gt;robots.txt&lt;/code&gt; is permissive. The bots get blocked anyway. Parsers say everything is fine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# What a healthy response looks like&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="s2"&gt;"Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"&lt;/span&gt; &lt;span class="nt"&gt;-I&lt;/span&gt; https://your-site.com
HTTP/2 200
content-type: text/html&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nv"&gt;charset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;utf-8

&lt;span class="c"&gt;# What an edge-level block looks like&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="s2"&gt;"Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"&lt;/span&gt; &lt;span class="nt"&gt;-I&lt;/span&gt; https://your-site.com
HTTP/2 403
server: cloudflare
cf-ray: 8b9c2f1e4a8d3c12-FRA
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;server: cloudflare&lt;/code&gt; plus a &lt;code&gt;403&lt;/code&gt; or &lt;code&gt;429&lt;/code&gt; means the request died at the edge. Same shape with &lt;code&gt;server: AmazonS3&lt;/code&gt; and a WAF rule, or &lt;code&gt;via: 1.1 fastly&lt;/code&gt;. We'll go deep on Cloudflare below; it's the biggest source of silent blocks by a wide margin.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: origin or application block
&lt;/h3&gt;

&lt;p&gt;The rest happens at your server. Less common than edge blocks. Easier to hide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Custom user-agent filtering.&lt;/strong&gt; Someone added &lt;code&gt;if (ua.includes("Bot")) return 403&lt;/code&gt; to middleware years ago. It catches &lt;code&gt;GPTBot&lt;/code&gt; along with everything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting.&lt;/strong&gt; Per-IP limits hit AI crawlers harder than human traffic because the crawler IPs are concentrated in a handful of datacenters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geo-blocking.&lt;/strong&gt; AI bots fetch from regions your geo rules don't trust.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JS-rendering invisibility.&lt;/strong&gt; &lt;code&gt;200 OK&lt;/code&gt;, empty body, model walks away with nothing. Worth its own section, coming up.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How to test: curl your page with each AI user-agent and read the response body. Don't just check the status code. A &lt;code&gt;200&lt;/code&gt; with no content is a &lt;code&gt;200&lt;/code&gt; that means nothing to a language model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bots that matter in 2026
&lt;/h2&gt;

&lt;p&gt;Most "AI crawler" lists copy each other and never explain what each bot actually does. Here's the practical version, sorted by what it costs you to block each one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;USER-AGENT              · PURPOSE                          · BLOCKING IMPACT                       · SEVERITY
─────────────────────────────────────────────────────────────────────────────────────────────────────────────
GPTBot                  · OpenAI training crawler          · Excluded from future GPT training     · low–medium *
ChatGPT-User            · ChatGPT live retrieval           · Invisible in ChatGPT answers          · CRITICAL
OAI-SearchBot           · ChatGPT Search index             · Excluded from ChatGPT Search          · HIGH
ClaudeBot               · Anthropic training crawler       · Excluded from Claude training         · low–medium *
Claude-User             · Claude live retrieval            · Invisible in Claude answers           · CRITICAL
anthropic-ai            · Legacy Anthropic UA              · Same as Claude-User (older clients)   · HIGH
PerplexityBot           · Perplexity index                 · Excluded from Perplexity              · HIGH
Perplexity-User         · Perplexity live retrieval        · Invisible to Perplexity queries       · CRITICAL
Google-Extended         · Gemini + AI Overviews            · Excluded from Google's AI surfaces    · HIGH
Applebot-Extended       · Apple Intelligence training      · Excluded from Apple AI features       · LOW
meta-externalagent      · Meta AI training / retrieval     · Excluded from Meta AI                 · MEDIUM
Bytespider              · ByteDance crawler                · Excluded from ByteDance AI products   · LOW
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;*&lt;/code&gt; Blocking a training crawler is a legitimate choice; lots of sites opt out and consider that fine. Blocking a &lt;em&gt;live retrieval&lt;/em&gt; crawler is almost always an accident that destroys your AI visibility.&lt;/p&gt;

&lt;p&gt;This distinction is the only AI crawler concept that actually matters. Everything else is footnotes. Two categories, opposite blast radius (and yes, the naming is genuinely awful — &lt;code&gt;ChatGPT-User&lt;/code&gt; and &lt;code&gt;GPTBot&lt;/code&gt; sound interchangeable, they aren't):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Training crawlers&lt;/strong&gt; (&lt;code&gt;GPTBot&lt;/code&gt;, &lt;code&gt;ClaudeBot&lt;/code&gt;, &lt;code&gt;Google-Extended&lt;/code&gt;, &lt;code&gt;Applebot-Extended&lt;/code&gt;). They index your content for future model training. Opting out keeps you out of training data. Your live AI visibility doesn't move.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live-retrieval crawlers&lt;/strong&gt; (&lt;code&gt;ChatGPT-User&lt;/code&gt;, &lt;code&gt;Claude-User&lt;/code&gt;, &lt;code&gt;Perplexity-User&lt;/code&gt;). They fetch a page right now because a human asked a question that needs it. Blocking these is what makes you invisible in the answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every "should I block AI?" debate that skips this distinction is wasted oxygen. You can opt out of training and still show up in answers. The configuration is just different.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cloudflare default block problem
&lt;/h2&gt;

&lt;p&gt;In July 2024, Cloudflare shipped a one-click "Block AI Scrapers and Crawlers" toggle and turned it on by default for new free-plan zones. It blocks at the edge, runs before your origin sees the request, and bypasses &lt;code&gt;robots.txt&lt;/code&gt; entirely.&lt;/p&gt;

&lt;p&gt;This single setting is likely responsible for more silent AI invisibility in 2026 than every misconfigured &lt;code&gt;robots.txt&lt;/code&gt; combined. Three things make it especially destructive:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It's on by default for many zones.&lt;/strong&gt; Anyone who created a Cloudflare site in the last 18 months may have it enabled without knowing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard tools can't see it.&lt;/strong&gt; Cloudflare blocks before your origin runs. &lt;code&gt;robots.txt&lt;/code&gt; is served by your origin. Parsers only see what the origin says — they're talking to a server that has no idea the conversation happened.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It blocks live retrieval too.&lt;/strong&gt; It doesn't distinguish training crawlers from live-retrieval ones. &lt;code&gt;ChatGPT-User&lt;/code&gt; gets the same &lt;code&gt;403&lt;/code&gt; as &lt;code&gt;GPTBot&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Picture Cloudflare as a bouncer at the door. The bouncer checks the user-agent on the ID and decides whether to let the request through. Your &lt;code&gt;robots.txt&lt;/code&gt; is a sign on the wall inside the building. The bouncer never reads it. The bot never gets close enough to.&lt;/p&gt;

&lt;p&gt;Anyone running a Cloudflare zone created since mid-2024 who hasn't checked their AI Audit settings should assume the bots are blocked until they've proven otherwise.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to verify
&lt;/h3&gt;

&lt;p&gt;Run the curl tests from the previous section. If you see &lt;code&gt;server: cloudflare&lt;/code&gt; with a &lt;code&gt;403&lt;/code&gt; on bot user-agents but a &lt;code&gt;200&lt;/code&gt; on a regular browser UA, this is what's happening:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Browser UA — passes&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="s2"&gt;"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"&lt;/span&gt; &lt;span class="nt"&gt;-I&lt;/span&gt; https://your-site.com
HTTP/2 200

&lt;span class="c"&gt;# AI bot UA — blocked at the edge&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="s2"&gt;"Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"&lt;/span&gt; &lt;span class="nt"&gt;-I&lt;/span&gt; https://your-site.com
HTTP/2 403
server: cloudflare
cf-mitigated: challenge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To confirm in the dashboard: &lt;strong&gt;Cloudflare → Security → Bots → Configure&lt;/strong&gt;. Look at the AI Audit / "Block AI Crawlers" toggle and the Super Bot Fight Mode settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to fix
&lt;/h3&gt;

&lt;p&gt;Three options, in increasing order of granularity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Turn the global AI block off
   → Cloudflare → Security → Bots → uncheck "Block AI Scrapers"
   → Use this if you want to be discoverable in all AI surfaces.

2. Allow specific bots, block the rest
   → Cloudflare → Security → WAF → Create a custom rule
   → Match: http.user_agent contains "ChatGPT-User"
   → Action: Skip
   → Useful if you want to block training but allow live retrieval.

3. Use the verified-bot allowlist
   → Cloudflare maintains a list of verified AI bots that bypass blocks.
   → Settings → Bots → Verified Bots → review which AI categories you trust.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same pattern shows up across edge providers. AWS WAF has managed rule groups that block AI bots by user-agent (&lt;code&gt;AWS-AWSManagedRulesBotControlRuleSet&lt;/code&gt;). Fastly customers have written custom VCL to do the same. If you're on any CDN, the question to ask is: is anything filtering by user-agent before my origin sees the request?&lt;/p&gt;

&lt;h2&gt;
  
  
  The JS-rendering trap
&lt;/h2&gt;

&lt;p&gt;There's a fourth way to be invisible that isn't technically a block, and it's worse because every diagnostic lies. Your server returns &lt;code&gt;200&lt;/code&gt;. Your headers look healthy. The bot fetches your page and walks away with nothing.&lt;/p&gt;

&lt;p&gt;AI crawlers don't run JavaScript. They read the initial HTML payload and stop. (Yes, Google's been rendering JS for search since 2019. The AI crawlers haven't caught up. They probably won't soon.) If your site is a client-rendered SPA, the bot is reading an empty &lt;code&gt;&amp;lt;body&amp;gt;&lt;/code&gt; and a &lt;code&gt;div#root&lt;/code&gt; that never gets populated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Server returns 200, but the body is empty&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="s2"&gt;"Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"&lt;/span&gt; https://your-spa.com | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt;
1247
&lt;span class="c"&gt;# 1247 bytes — basically just the shell. The actual content is rendered by JS.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to be more rigorous:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pipe the response through a text extractor and count actual content&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="s2"&gt;"Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"&lt;/span&gt; https://your-spa.com &lt;span class="se"&gt;\&lt;/span&gt;
    | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/&amp;lt;[^&amp;gt;]*&amp;gt;//g'&lt;/span&gt; | &lt;span class="nb"&gt;tr&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s1"&gt;'[:space:]'&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-w&lt;/span&gt;
14
&lt;span class="c"&gt;# 14 words of content visible to GPTBot. The page actually has 1,200.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or just open dev tools, disable JavaScript, and reload. If your content disappears, the AI crawlers are seeing the same blank page.&lt;/p&gt;

&lt;p&gt;The fix is server-rendering or static generation. Astro, Remix, and SvelteKit default to SSR. The pattern that breaks is the pure SPA. CRA, Vite without SSR, anything that ships an empty index.html and hydrates from there.&lt;/p&gt;

&lt;p&gt;Not a quick fix. But if you've ruled out &lt;code&gt;robots.txt&lt;/code&gt; and edge blocks and you're still invisible, this is probably what's happening. The bot can't see your content because there isn't any to see.&lt;/p&gt;

&lt;h2&gt;
  
  
  The opt-outs that actually matter
&lt;/h2&gt;

&lt;p&gt;Not every AI bot deserves the same answer. Treating "block AI" as a binary instead of a per-bot judgment call is how you end up either too permissive (free training data, no upside) or too restrictive (invisible in answers you wanted to be in). A short opinionated breakdown:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Always allow live-retrieval bots.&lt;/strong&gt; &lt;code&gt;ChatGPT-User&lt;/code&gt;, &lt;code&gt;Claude-User&lt;/code&gt;, &lt;code&gt;Perplexity-User&lt;/code&gt;. There is no downside. These fetch your page only when a human is actively asking a question that points at your content. Blocking them is a self-inflicted wound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Allow training bots only if you want to be in training data.&lt;/strong&gt; &lt;code&gt;GPTBot&lt;/code&gt;, &lt;code&gt;ClaudeBot&lt;/code&gt;, &lt;code&gt;anthropic-ai&lt;/code&gt;. Opting out is a legitimate choice that more sites are making, especially publishers and SaaS companies who'd rather not have their docs used as gradient updates. Your live AI visibility isn't affected either way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Google-Extended&lt;/code&gt; is the complicated one.&lt;/strong&gt; It controls Gemini and AI Overviews. Opting out keeps you out of those surfaces, which is increasingly costly as AI Overviews show up on a growing share of Google searches. &lt;code&gt;Google-Extended&lt;/code&gt; is separate from &lt;code&gt;Googlebot&lt;/code&gt; (disallowing the former has zero effect on your regular Google rankings). Lots of sites that thought they were opting out of "AI" tank their Gemini visibility and change nothing about their search traffic. Which is dumb.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Applebot-Extended&lt;/code&gt;, &lt;code&gt;meta-externalagent&lt;/code&gt;, &lt;code&gt;Bytespider&lt;/code&gt; are lower-stakes.&lt;/strong&gt; These bots feed AI surfaces with much smaller market share. Decide on principle, not blast radius.&lt;/p&gt;

&lt;p&gt;The framing that helps: every AI bot is either a customer (live retrieval, sends users back to you) or a vendor (training, builds models that may or may not link back to you). Most sites should let the customers in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 30-second version
&lt;/h2&gt;

&lt;p&gt;Configuring &lt;code&gt;robots.txt&lt;/code&gt; right in 2026 keeps you from being trivially invisible. It doesn't make you visible. The failure mode killing AI visibility for most sites isn't a missing &lt;code&gt;robots.txt&lt;/code&gt; directive. It's an edge-level block they didn't know was on, or a JS-rendered page the crawler can't read. If &lt;code&gt;robots.txt&lt;/code&gt; is the only thing you test, you're checking the layer where almost nothing actually goes wrong.&lt;/p&gt;

&lt;p&gt;Test the live fetch. Run it under every bot UA you care about. And don't just check the status code — read what came back. Your site might be one Cloudflare toggle away from being invisible to half the web in 2027 - and that toggle is already there, waiting.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>seo</category>
      <category>cloudflare</category>
    </item>
  </channel>
</rss>
