Mike

Posted on Jun 4 • Originally published at brandswarm.io

robots.txt for AI search: the 2026 cheat sheet (GPTBot, ClaudeBot, and the rest)

#ai #webdev #seo #tutorial

Originally published at brandswarm.io/blog/robots-txt-for-ai-search/.

Your robots.txt is the first place AI crawlers look when they
arrive at your site. Get it wrong and you're invisible to ChatGPT, Claude,
Perplexity, Gemini, and AI Overviews regardless of how good your content,
schema, or backlinks are. Get it right and the cost is zero — it's just a
text file.

This is the cheat sheet. Every AI crawler that matters in 2026, whether
to allow them, and a copy-pasteable robots.txt file you can
drop in today.

The user-agents that matter

User-agent	Operator	What it does	Allow?
`GPTBot`	OpenAI	Trains future models. Does NOT do real-time retrieval for ChatGPT.	Yes — visibility, not training
`OAI-SearchBot`	OpenAI	Retrieval for ChatGPT search / SearchGPT.	Yes — direct ChatGPT visibility
`ChatGPT-User`	OpenAI	Used when a user invokes ChatGPT's browsing tool. Fetches a single URL.	Yes — required for browsing
`ClaudeBot`	Anthropic	Crawl for Claude (training + retrieval).	Yes — direct Claude visibility
`Claude-Web` / `anthropic-ai`	Anthropic	Older / alternate user-agent variants.	Yes — same reason
`Google-Extended`	Google	Crawls for Gemini training. Separate from Googlebot.	Optional — yes if you want training inclusion
`Googlebot`	Google	Powers regular Google search + AI Overviews. Do not block.	Always yes
`PerplexityBot`	Perplexity	Retrieval for Perplexity answers.	Yes
`Perplexity-User`	Perplexity	Fetches single URLs when users follow Perplexity links.	Yes
`Bytespider`	ByteDance	Crawls for Doubao / TikTok AI features.	Yes if you have TikTok/APAC audience
`Amazonbot`	Amazon	Powers Alexa / Q / Amazon AI features.	Optional
`Applebot-Extended`	Apple	Crawls for Apple Intelligence training. Separate from Applebot (search).	Optional
`Applebot`	Apple	Powers Spotlight + Siri suggestions.	Always yes
`meta-externalagent`	Meta	Crawls for Meta AI training.	Optional
`CCBot`	Common Crawl	Open-source crawl used as training data by many models.	Optional — wide influence
`Bingbot`	Microsoft	Regular Bing search + ChatGPT browsing tool retrieval.	Always yes
`DuckAssistBot`	DuckDuckGo	Powers DuckDuckGo's AI Assist.	Yes
`Diffbot` / `BrandBot` / etc.	Various	Niche crawlers used by enterprise AI tools.	Optional — minor traffic

Quick decision: 3 policies that cover 95% of cases

Policy A: maximum AI visibility (recommended for SaaS, content brands, B2B)

# Maximum AI visibility. Allows training + retrieval for all major engines.
User-agent: *
Allow: /

# Block private/auth surfaces from any crawler
User-agent: *
Disallow: /admin/
Disallow: /app/
Disallow: /billing/
Disallow: /accounts/

Sitemap: https://yourdomain.com/sitemap.xml

This is the right policy if your business benefits from being discovered
in AI answers. Almost every SaaS, B2B company, and brand that sells anything
falls into this category. The wildcard User-agent: * applies to
every crawler including the AI ones.

Policy B: allow AI retrieval, block AI training (the Content-Signal compromise)

# Allow real-time retrieval (so AI can cite you when users ask)
# but signal that content should not be used for model training.
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
Allow: /

Disallow: /admin/
Disallow: /app/
Disallow: /billing/

Sitemap: https://yourdomain.com/sitemap.xml

Use this if you want to be discoverable in ChatGPT/Perplexity/Gemini answers
but you don't want your content baked into next year's model training data.
The Content-Signal header is honored by OpenAI, Anthropic,
Google, and Perplexity as of mid-2025. It's the right middle ground.

Policy C: block everything (only for sites that genuinely don't want AI visibility)

# Block all AI crawlers explicitly. Allow Googlebot/Bingbot for traditional search.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: CCBot
Disallow: /

# Allow traditional search
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Use this only if you have a strong reason — premium paid content, news org
with monetization concerns, sensitive material. Be aware: blocking AI
retrieval means your brand will not be cited when users ask AI assistants
about your category. For most businesses, this is a strategic mistake.

The Cloudflare gotcha

If your site is behind Cloudflare, there's a feature called "AI Crawl
Control → Managed robots.txt" that injects a Policy-C-style block
into your robots.txt on the wire, regardless of what your origin
serves. The toggle is on by default for many zones. Many brands are blocking
every AI crawler without knowing.

To check: curl https://yourdomain.com/robots.txt. If you see a
block titled "# BEGIN Cloudflare Managed content", you're
affected. Turn the toggle off in Cloudflare → AI Crawl Control → Managed
robots.txt. We wrote up the full story
here.

Validating your `robots.txt`

Three quick checks:

Use Google's robots.txt tester in Search Console — paste a URL and a user-agent, it tells you if the page is fetchable. Their tester is now under the URL Inspection tool.
Curl with each crawler's user-agent and inspect the response:

   curl -A "GPTBot" https://yourdomain.com/robots.txt
   curl -A "ClaudeBot" https://yourdomain.com/robots.txt

Watch Bing Webmaster Tools' Crawl Errors — Bing reports robots.txt-blocked URLs there. Other engines don't surface this as cleanly.

Three rules

Specific user-agents override the wildcard. If you have User-agent: * Allow: / and below it User-agent: GPTBot Disallow: /, GPTBot is blocked. The wildcard isn't a fallback; it's a default that specific rules override.
One User-agent block per crawler. Some sites repeat User-agent: GPTBot with different rules in different blocks; only the first block is honored. Consolidate.
Don't block Googlebot when you mean Google-Extended. These are different crawlers. Googlebot powers Search + AI Overviews. Google-Extended powers Gemini training. Blocking Googlebot tanks your traditional Google traffic.

FAQ

I want to be in ChatGPT but not in Claude. Can I?

Yes. Allow GPTBot, OAI-SearchBot, and
ChatGPT-User; disallow ClaudeBot,
Claude-Web, and anthropic-ai. Practical impact is
modest because most brands want presence everywhere AI assistants exist,
but the option is there.

What about `noai` and `noimageai` meta tags?

These are the page-level equivalent of robots.txt rules. They tell crawlers
not to use the page's content for AI training. Less widely honored than
Content-Signal headers; useful as defense-in-depth on
pages where you really care.

What about `llms.txt`?

A proposed standard for "here's a curated text version of my content for
LLMs to ingest cleanly." Adoption is uneven; OpenAI and Anthropic both
said publicly in 2025 that they prefer to crawl normally. Worth shipping
if it's easy to generate, but don't rely on it as your primary AI-visibility
strategy.

Do I need to also add an `X-Robots-Tag` HTTP header?

Only if you want per-page granularity that robots.txt can't express (e.g.,
"noindex this specific PDF without listing it"). For broad AI-visibility
policy, robots.txt is sufficient.

Bottom line

Most brands win by shipping Policy A. Some by shipping Policy B. Very few
should ship Policy C. Whichever you choose, do it deliberately — and
re-check after every CDN configuration change. The most common reason
brands lose AI visibility isn't a strategy decision; it's a CDN feature
that flipped a switch they didn't notice.

DEV Community

robots.txt for AI search: the 2026 cheat sheet (GPTBot, ClaudeBot, and the rest)

The user-agents that matter

Quick decision: 3 policies that cover 95% of cases

Policy A: maximum AI visibility (recommended for SaaS, content brands, B2B)

Policy B: allow AI retrieval, block AI training (the Content-Signal compromise)

Policy C: block everything (only for sites that genuinely don't want AI visibility)

The Cloudflare gotcha

Validating your `robots.txt`

Three rules

FAQ

I want to be in ChatGPT but not in Claude. Can I?

What about `noai` and `noimageai` meta tags?

What about `llms.txt`?

Do I need to also add an `X-Robots-Tag` HTTP header?

Bottom line

Top comments (0)

The user-agents that matter

Quick decision: 3 policies that cover 95% of cases

Policy A: maximum AI visibility (recommended for SaaS, content brands, B2B)

Policy B: allow AI retrieval, block AI training (the Content-Signal compromise)

Policy C: block everything (only for sites that genuinely don't want AI visibility)

The Cloudflare gotcha

Validating your robots.txt

Three rules

FAQ

I want to be in ChatGPT but not in Claude. Can I?

What about noai and noimageai meta tags?

What about llms.txt?

Do I need to also add an X-Robots-Tag HTTP header?

Bottom line

Validating your `robots.txt`

What about `noai` and `noimageai` meta tags?

What about `llms.txt`?

Do I need to also add an `X-Robots-Tag` HTTP header?