Mike

Posted on Jun 10 • Originally published at brandswarm.io

The Cloudflare toggle that's blocking your brand from ChatGPT

#ai #seo #webdev #cloudflare

Originally published on the Brandswarm blog.

We build Brandswarm. It's a SaaS that tracks how AI assistants describe your brand — ChatGPT, Claude, Perplexity, Gemini, AI Overviews. The product is supposed to be the canonical place a marketing team comes when they want to be visible to AI search engines.

An hour ago we discovered, while auditing our own setup, that we were ourselves invisible to ChatGPT. Cloudflare — our CDN — was silently blocking GPTBot, ClaudeBot, Google-Extended, PerplexityBot, Bytespider, and seven others from crawling our site. The control that did it is a single toggle in a dashboard most users have never opened.

We flipped it off, and our robots.txt went from 70 lines (most of them Disallow: / against AI crawlers) to 15 lines. Now AI engines can do what we built the entire company to help them do.

The reason we're writing this — instead of just quietly fixing it and moving on — is that we don't think we're an isolated case. Cloudflare hosts roughly 20% of the web. The toggle in question is on by default for a large fraction of Cloudflare zones, the dashboard for it is hidden three levels deep in the navigation, and most teams have no idea it's there. If you're a marketing leader asking why ChatGPT and Perplexity don't seem to know your brand exists, this is the first thing to rule out — and there's a good chance it's the answer.

The one-line check

Before you do anything else, run this:

curl https://yourbrand.com/robots.txt

If you see any of these lines in the output, your site is blocking AI crawlers:

# BEGIN Cloudflare Managed content

User-agent: *
Content-Signal: search=yes,ai-train=no
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# ...and so on for ~10 other AI user-agents

The Content-Signal: search=yes,ai-train=no line on its own is the polite "don't train on us" signal — most teams are happy with that. But the User-agent: GPTBot and User-agent: ClaudeBot blocks underneath go further: they tell the crawlers not to fetch the page at all. That kills retrieval-layer inclusion, which is the entire basis of being mentioned by ChatGPT and Claude when someone asks about your category.

What's happening, technically

In 2024 Cloudflare launched a feature called AI Audit (sometimes called "AI Crawl Control" in the current UI). It included a toggle called "Managed robots.txt". The toggle's own description, verbatim:

When enabled, Cloudflare creates or updates your robots.txt file to signal that your content should not be used for AI training.

That's a defensible product. Plenty of site owners have good reason to opt out of being training data for someone else's commercial model. The problem is twofold:

For many zones, the toggle defaults to on. New customers inherit it without knowing.
The toggle conflates two very different things — training inclusion (which a brand might reasonably opt out of) and real-time retrieval inclusion (which is how ChatGPT browsing, Perplexity, AI Overviews, and Gemini decide what to mention in their answers). Blocking GPTBot blocks both. Most marketers who opt out of training don't realize they're also opting out of being visible.

Cloudflare is signalling, on your behalf and via the wire, that you don't want to be part of AI search results. If you sell anything, you almost certainly don't want that. But unless you've been to a specific dashboard page, you have no idea this is happening.

The fix

The fix is one toggle:

Log in to the Cloudflare dashboard.
Select the zone for your domain.
Left sidebar → AI Crawl Control (some accounts: "AI Audit").
Find the "Managed robots.txt" card — there's a blue toggle in the top-right.
Switch it off.
Wait ~30 seconds for the change to propagate.
Re-curl your robots.txt to confirm. It should go from ~70 lines back to whatever your application actually serves — usually 10–20 lines.

That's it. From this moment, AI crawlers can read your site.

What we expect to happen next on our own site

Since we just flipped it, we don't have a "before/after" graph yet. What we expect, based on what we've seen with customers:

Week 1–2 : Bing's crawler (which feeds ChatGPT's browsing tool) starts re-indexing pages it had given up on. AI Overviews start seeing the site as a legitimate source.
Week 3–4 : Mentions in Perplexity for category-relevant queries begin to appear. Citation rate climbs.
Month 2–3 : ChatGPT and Claude begin synthesizing the site into answers when the underlying content is retrieval-worthy (good structured data, real category-positioning content, third-party citations).

We'll publish the actual before/after numbers from our own scans in a follow-up. If you're affected by this and you fix it, we'd love to see your data too.

The bigger pattern

The CDN configuration is one of three places we routinely find AI-visibility problems hiding:

CDN-injected robots.txt rules (this post). Cloudflare's is the most prevalent because of Cloudflare's market share, but other CDNs have similar features.
Bing exclusion. ChatGPT's web-browsing tool runs through Bing, and a surprising number of brands with strong Google rankings are completely absent from Bing's index. Verify in Bing Webmaster Tools.
Structured-data gaps. AI retrieval layers heavily weight pages with proper Schema.org JSON-LD (Organization, Product, FAQ, HowTo). Sites that rely entirely on rendered HTML without structured data get retrieved less.

We wrote the longer playbook on this in How to appear in ChatGPT answers (the full 2026 playbook). Fixing your CDN is step one in that playbook for a reason — it's the highest-impact and the easiest.

Why we found this on ourselves

The reason we caught this on Brandswarm is that we ran a real AI-visibility scan against our own domain before our public launch. The scan came back with very low visibility scores, which we initially attributed to the site being too young. When we dug in, we noticed our own marketing tool was reporting that GPTBot couldn't reach us. Pulled robots.txt — there it was.

The mildly embarrassing version of this story is that a SaaS built specifically to help brands appear in AI search was, for its entire pre-launch window, blocked from appearing in AI search. The useful version is that if it happened to us — a team that thinks about this category every day — it's almost certainly happening to your team too.

One more place to check

While you're in the Cloudflare dashboard, also check:

Security → Bots → AI Audit → Block AI Bots. This is a separate toggle that blocks AI bots at the firewall layer (not just via robots.txt). If it's on, turn it off. The dashboard will show "Unsuccessful requests" counters if this is the active block.
Security → WAF → Managed Rules. Look for any rule with "AI" or "bot" in the name. Disable any that are blocking AI user-agents.
Cache Rules. Some teams put AI crawlers in a "cache nothing" bucket, which is fine, but if they're also bypassed from the cache, response times can suffer enough that crawlers give up.

If you're not on Cloudflare

Other CDNs have similar features. AWS CloudFront has a "Block AI bots" managed rule. Akamai has "AI Bot Manager." Fastly has bot-policy rules that can include AI crawlers. Vercel and Netlify allow domain-level robots.txt overrides; check that yours doesn't disallow GPTBot, ClaudeBot, Google-Extended, and PerplexityBot.

FAQ

I do actually want to opt out of being training data. What's the right way?

Use the Content-Signal mechanism (which Cloudflare supports cleanly) rather than the full Disallow: / block. The Content Signal taxonomy distinguishes three uses: search (indexing for retrieval), ai-input (real-time RAG / retrieval into AI answers), and ai-train (training data). If you only want to block training, set search=yes, ai-input=yes, ai-train=no. Well-behaved crawlers honor it. You stay visible in AI search while opting out of being baked into next year's model.

How do I know if a crawler honors Content-Signal vs. robots.txt?

The major ones (OpenAI, Anthropic, Google, Perplexity, Microsoft, ByteDance) all honor robots.txt strictly. Content-Signal is newer (2024) and adoption is increasing but not universal yet. For now, the conservative approach is to use robots.txt to allow the crawlers and Content-Signal to restrict the use.

Will this hurt my Google rankings?

No. None of the AI crawlers we discussed are Googlebot (Google's regular search crawler). They're a separate set of bots specifically for AI features. Unblocking AI crawlers doesn't affect classical search at all.

What does Brandswarm do here?

We run a scan of your domain across all five major AI surfaces — ChatGPT, Claude, Perplexity, Gemini, AI Overviews — and tell you exactly where you're visible, where you're not, and what's likely causing the gap. The free scan doesn't require a credit card and runs in 60 seconds. If the cause is something like the Cloudflare block above, we'll flag it.

Bottom line

If your robots.txt has a "Cloudflare Managed content" block in it, you're invisible to AI search engines. The fix is one toggle. The impact, if you've been blocked for any length of time, takes weeks to recover — but it does recover. Go check yours.

DEV Community