DEV Community

hey atlas
hey atlas

Posted on

I audited my own robots.txt and found Cloudflare was blocking the AI crawlers I wanted

A few days ago I was trying to figure out why my site barely showed up in AI answers. I run an independent AI-tools review site, and getting cited by ChatGPT, Perplexity, and Gemini is basically my whole distribution strategy. So I did the obvious thing and read my own robots.txt. What I found was embarrassing, and it turns out it is really common.

The file in my repo was not the file on my server

My repository had a hand-written robots.txt that explicitly welcomed AI crawlers:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /
Enter fullscreen mode Exit fullscreen mode

Felt good. Very GEO-friendly. The problem: the robots.txt my server actually served to crawlers was a completely different file:

# BEGIN Cloudflare Managed content
User-agent: *
Content-Signal: search=yes,ai-train=no,use=reference
Allow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /
# END Cloudflare Managed Content
Enter fullscreen mode Exit fullscreen mode

None of my carefully written Allow rules reached a single crawler. Cloudflare has a managed "Block AI bots" feature, and when it is on, it replaces your robots.txt with its own managed version. I had been shipping Allow: / to a file nobody was serving.

The lesson that cost me: always curl the live file, never trust the one in your repo.

curl -s https://yourdomain.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

The distinction that actually matters: training crawlers vs citation crawlers

Here is the part most "block AI bots" toggles get wrong. There are two very different kinds of AI crawler, and they are not interchangeable:

  • Training and ingest crawlers: GPTBot, ClaudeBot, CCBot, Google-Extended, Bytespider. These pull your content into a training set or a grounding corpus.
  • Live search and citation crawlers: OAI-SearchBot, ChatGPT-User, PerplexityBot. These fetch a page at answer time so the assistant can link to and quote it.

If you want to keep your content out of training sets but still be cited in AI answers, those are two separate decisions. A blanket "block AI bots" rule collapses them into one and usually kills the citation crawlers along with the training ones. In my case, the search crawlers happened to survive (they were not in Cloudflare's block list), but the double signal for GPTBot and ClaudeBot (a global Allow: / and then a named Disallow: /) is exactly the kind of ambiguity a conservative bot resolves by skipping you.

Content-Signals: the precise version of the same idea

You may have noticed this line in the managed file:

Content-Signal: search=yes,ai-train=no,use=reference
Enter fullscreen mode Exit fullscreen mode

Content-Signals are a newer, more surgical control than a blunt Disallow. Instead of "you may not read this at all," you can say "yes to search indexing, no to model training, reference use only." It is the difference between locking the door and putting up a sign that says what the room is for. If your goal is visible but not trained on, this is the tool you want, not a wall.

A 60-second self-audit

  1. curl your live robots.txt (not the repo copy).
  2. grep for Disallow: / sitting under any AI user-agent.
  3. Confirm OAI-SearchBot, PerplexityBot, and ChatGPT-User are allowed. These are the ones that get you cited.
  4. If you are on Cloudflare, check Security to Bots for a managed AI block rule that is quietly overriding your file.
  5. Add an llms.txt so assistants can find your best pages in a clean, parseable format.

I got tired of eyeballing user-agent lists by hand, so I built a free checker that does steps 1 to 4 for any URL: it fetches the live robots.txt, tells you which AI crawlers are allowed or blocked, and flags whether you have an llms.txt. No signup. It lives here: AI Crawler Access Checker.

Why I now care about this more than my meta tags

Search is slowly shifting from "ten blue links" to "one synthesized answer with a few citations." If you are not in the citation set, you are invisible in that world no matter how good your on-page SEO is. And the most common reason a good page never gets cited is not content quality, it is a robots.txt conflict the owner never knew was there, often injected by a CDN toggle someone flipped once and forgot.

Go read your live robots.txt today. It takes one curl. You might find, like I did, that you have been politely inviting crawlers through a door that was bolted shut the whole time.


I publish honest, hands-on AI-tool reviews and daily AI news. If that is useful, the daily drop is on Telegram: t.me/aitoolsinsiderhq.

Top comments (0)