DEV Community

Mike
Mike

Posted on • Originally published at brandswarm.io

robots.txt for AI search: the 2026 cheat sheet (GPTBot, ClaudeBot, and the rest)

Originally published at brandswarm.io/blog/robots-txt-for-ai-search/.

Your robots.txt is the first place AI crawlers look when they
arrive at your site. Get it wrong and you're invisible to ChatGPT, Claude,
Perplexity, Gemini, and AI Overviews regardless of how good your content,
schema, or backlinks are. Get it right and the cost is zero — it's just a
text file.

This is the cheat sheet. Every AI crawler that matters in 2026, whether
to allow them, and a copy-pasteable robots.txt file you can
drop in today.

The user-agents that matter

User-agent Operator What it does Allow?
GPTBot OpenAI Trains future models. Does NOT do real-time retrieval for ChatGPT. Yes — visibility, not training
OAI-SearchBot OpenAI Retrieval for ChatGPT search / SearchGPT. Yes — direct ChatGPT visibility
ChatGPT-User OpenAI Used when a user invokes ChatGPT's browsing tool. Fetches a single URL. Yes — required for browsing
ClaudeBot Anthropic Crawl for Claude (training + retrieval). Yes — direct Claude visibility
Claude-Web / anthropic-ai Anthropic Older / alternate user-agent variants. Yes — same reason
Google-Extended Google Crawls for Gemini training. Separate from Googlebot. Optional — yes if you want training inclusion
Googlebot Google Powers regular Google search + AI Overviews. Do not block. Always yes
PerplexityBot Perplexity Retrieval for Perplexity answers. Yes
Perplexity-User Perplexity Fetches single URLs when users follow Perplexity links. Yes
Bytespider ByteDance Crawls for Doubao / TikTok AI features. Yes if you have TikTok/APAC audience
Amazonbot Amazon Powers Alexa / Q / Amazon AI features. Optional
Applebot-Extended Apple Crawls for Apple Intelligence training. Separate from Applebot (search). Optional
Applebot Apple Powers Spotlight + Siri suggestions. Always yes
meta-externalagent Meta Crawls for Meta AI training. Optional
CCBot Common Crawl Open-source crawl used as training data by many models. Optional — wide influence
Bingbot Microsoft Regular Bing search + ChatGPT browsing tool retrieval. Always yes
DuckAssistBot DuckDuckGo Powers DuckDuckGo's AI Assist. Yes
Diffbot / BrandBot / etc. Various Niche crawlers used by enterprise AI tools. Optional — minor traffic

Quick decision: 3 policies that cover 95% of cases

Policy A: maximum AI visibility (recommended for SaaS, content brands, B2B)

# Maximum AI visibility. Allows training + retrieval for all major engines.
User-agent: *
Allow: /

# Block private/auth surfaces from any crawler
User-agent: *
Disallow: /admin/
Disallow: /app/
Disallow: /billing/
Disallow: /accounts/

Sitemap: https://yourdomain.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

This is the right policy if your business benefits from being discovered
in AI answers. Almost every SaaS, B2B company, and brand that sells anything
falls into this category. The wildcard User-agent: * applies to
every crawler including the AI ones.

Policy B: allow AI retrieval, block AI training (the Content-Signal compromise)

# Allow real-time retrieval (so AI can cite you when users ask)
# but signal that content should not be used for model training.
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
Allow: /

Disallow: /admin/
Disallow: /app/
Disallow: /billing/

Sitemap: https://yourdomain.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

Use this if you want to be discoverable in ChatGPT/Perplexity/Gemini answers
but you don't want your content baked into next year's model training data.
The Content-Signal header is honored by OpenAI, Anthropic,
Google, and Perplexity as of mid-2025. It's the right middle ground.

Policy C: block everything (only for sites that genuinely don't want AI visibility)

# Block all AI crawlers explicitly. Allow Googlebot/Bingbot for traditional search.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: CCBot
Disallow: /

# Allow traditional search
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

Use this only if you have a strong reason — premium paid content, news org
with monetization concerns, sensitive material. Be aware: blocking AI
retrieval means your brand will not be cited when users ask AI assistants
about your category. For most businesses, this is a strategic mistake.

The Cloudflare gotcha

If your site is behind Cloudflare, there's a feature called "AI Crawl
Control → Managed robots.txt"
that injects a Policy-C-style block
into your robots.txt on the wire, regardless of what your origin
serves. The toggle is on by default for many zones. Many brands are blocking
every AI crawler without knowing.

To check: curl https://yourdomain.com/robots.txt. If you see a
block titled "# BEGIN Cloudflare Managed content", you're
affected. Turn the toggle off in Cloudflare → AI Crawl Control → Managed
robots.txt. We wrote up the full story
here.

Validating your robots.txt

Three quick checks:

  1. Use Google's robots.txt tester in Search Console — paste a URL and a user-agent, it tells you if the page is fetchable. Their tester is now under the URL Inspection tool.
  2. Curl with each crawler's user-agent and inspect the response:
   curl -A "GPTBot" https://yourdomain.com/robots.txt
   curl -A "ClaudeBot" https://yourdomain.com/robots.txt
Enter fullscreen mode Exit fullscreen mode
  1. Watch Bing Webmaster Tools' Crawl Errors — Bing reports robots.txt-blocked URLs there. Other engines don't surface this as cleanly.

Three rules

  1. Specific user-agents override the wildcard. If you have User-agent: * Allow: / and below it User-agent: GPTBot Disallow: /, GPTBot is blocked. The wildcard isn't a fallback; it's a default that specific rules override.
  2. One User-agent block per crawler. Some sites repeat User-agent: GPTBot with different rules in different blocks; only the first block is honored. Consolidate.
  3. Don't block Googlebot when you mean Google-Extended. These are different crawlers. Googlebot powers Search + AI Overviews. Google-Extended powers Gemini training. Blocking Googlebot tanks your traditional Google traffic.

FAQ

I want to be in ChatGPT but not in Claude. Can I?

Yes. Allow GPTBot, OAI-SearchBot, and
ChatGPT-User; disallow ClaudeBot,
Claude-Web, and anthropic-ai. Practical impact is
modest because most brands want presence everywhere AI assistants exist,
but the option is there.

What about noai and noimageai meta tags?

These are the page-level equivalent of robots.txt rules. They tell crawlers
not to use the page's content for AI training. Less widely honored than
Content-Signal headers; useful as defense-in-depth on
pages where you really care.

What about llms.txt?

A proposed standard for "here's a curated text version of my content for
LLMs to ingest cleanly." Adoption is uneven; OpenAI and Anthropic both
said publicly in 2025 that they prefer to crawl normally. Worth shipping
if it's easy to generate, but don't rely on it as your primary AI-visibility
strategy.

Do I need to also add an X-Robots-Tag HTTP header?

Only if you want per-page granularity that robots.txt can't express (e.g.,
"noindex this specific PDF without listing it"). For broad AI-visibility
policy, robots.txt is sufficient.

Bottom line

Most brands win by shipping Policy A. Some by shipping Policy B. Very few
should ship Policy C. Whichever you choose, do it deliberately — and
re-check after every CDN configuration change. The most common reason
brands lose AI visibility isn't a strategy decision; it's a CDN feature
that flipped a switch they didn't notice.

Top comments (0)