DEV Community

hey atlas
hey atlas

Posted on • Originally published at aitoolsinsiderhq.com

I read the robots.txt of 41 top AI tools. 88% block nothing, and the rest mostly block the wrong bots.

Everyone assumes the big AI tools are racing to wall their data off from crawlers. I got curious about whether that's actually true, so I read the live robots.txt of 41 well-known AI and SaaS tools and scored each against the 10 biggest AI crawlers. The result surprised me.

The headline

36 of 41 tools (88%) block no AI crawler at all. GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, the lot, all welcome. Only 5 tools block anything, and GPTBot is the single most-blocked bot, at a whopping 7% of sites.

So the "AI companies are locking everything down" narrative basically doesn't show up in the robots.txt of the tools themselves. For a marketing site, being readable is how you get cited in an AI answer, and a citation is the new top of the funnel. Most of them seem to have figured that out.

The distinction almost everyone gets wrong

There are two kinds of AI crawler, and they are not the same thing:

  • Citation crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-User) fetch a page so an answer engine can quote and link it. Block these and you vanish from AI search. That's pure lost traffic.
  • Training crawlers (GPTBot, ClaudeBot, CCBot, Bytespider, Google-Extended) collect pages to train a model. Block these and you lose... nothing in traffic.

The smart robots.txt blocks training bots and allows citation bots. In my sample, citation bots were blocked by at most one site each, while the blocks that did exist were aimed mostly at training. So collectively, the field is getting it right.

The two outliers tell the whole story

  • Figma is the strictest, and for AI search the most self-defeating: it blocks six crawlers including the citation bots. Net effect: Figma's own pages can't be surfaced or cited inside ChatGPT or Perplexity answers.
  • Canva blocks four bots, but only training ones (GPTBot, ClaudeBot, CCBot, Bytespider) while leaving citation bots open. That's the textbook-correct move: deny the free training data, keep the AI-search visibility.

Same instinct ("don't feed the machines for free"), opposite outcome, because one of them knew which bots to target.

A robots.txt that does it right

# Deny free training
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /

# Keep AI-search citations
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
Enter fullscreen mode Exit fullscreen mode

Check your own site

I pulled this together with a free tool I built that does the parse for you. Paste your robots.txt and it flags all 18 major AI bots, tags each as citation vs training, and spits out a recommended file: AI Crawler & robots.txt Access Checker.

Full study, per-bot table, and the dataset (CC BY 4.0, reuse it freely): Who Blocks ChatGPT, Claude & Perplexity?

What does your robots.txt block? Most people I've shown this to were blocking the wrong half.

Top comments (0)