Nikhil Goyal

Posted on Mar 31

The robots.txt Mistake That's Killing Your AI Search Visibility

#seo #ai #aeo #webdev

There's a good chance your website is invisible to ChatGPT, Perplexity, and every other AI search engine — and the fix takes about 2 minutes.

I've been auditing sites for AI readability for the past year, and the single most common issue I find isn't bad content or missing schema. It's robots.txt blocking AI crawlers entirely. The site owner has no idea. They're optimizing content, writing FAQ pages, adding structured data — and none of it matters because the front door is locked.

Here's how to check yours and fix it.

The 30-second check

Run this right now:

curl -s https://yoursite.com/robots.txt

Now look for any of these bot names in Disallow rules:

GPTBot — OpenAI's crawler (powers ChatGPT citations)
OAI-SearchBot — OpenAI's search indexer (powers ChatGPT search)
ChatGPT-User — fetches pages when a ChatGPT user asks for live info
ClaudeBot — Anthropic's training crawler
Claude-SearchBot — Anthropic's search indexer (powers Claude's web search)
PerplexityBot — Perplexity's search crawler
Google-Extended — Google's AI training crawler (feeds Gemini and AI Overviews)
Applebot-Extended — Apple's AI training crawler (feeds Apple Intelligence)

Or just grep for it:

curl -s https://yoursite.com/robots.txt | grep -iE "gptbot|oai-searchbot|chatgpt-user|claudebot|claude-searchbot|perplexitybot|google-extended|applebot-extended"

If you see Disallow: / next to any of those, that crawler can't see your site.

Why this happens more than you'd think

In about 4 out of 10 sites I audit, at least one major AI crawler is blocked. Here's how it happens:

1. The wildcard block

The most common culprit. Someone added this years ago and forgot about it:

User-agent: *
Disallow: /

This blocks everything — Googlebot, AI crawlers, all of it. Sometimes it was intentional for a staging site and got copied to production. Sometimes it's a CMS default that nobody changed.

2. WordPress security plugins

Plugins like Wordfence, Sucuri, and All In One Security sometimes add bot-blocking rules automatically. I've seen configs that specifically block GPTBot and ClaudeBot because they were categorized as "scrapers" in early 2024 when AI crawling was more controversial.

Check your security plugin settings — some have an "AI bot blocking" toggle that's enabled by default.

3. The copy-paste robots.txt

A lot of robots.txt files in the wild were copied from blog posts written in 2023-2024, when the default recommendation was to block AI crawlers to "protect your content." The landscape has shifted. If your goal is visibility, those rules are now working against you.

4. CDN or hosting-level blocks

Cloudflare, Vercel, and other platforms offer bot management settings. Some templates or one-click security configs block AI user agents at the infrastructure level, before robots.txt even gets read. If your robots.txt looks clean but AI crawlers still aren't hitting your server logs, check your CDN or hosting settings.

The distinction I wish someone had explained to me earlier

When I first started looking into this, I treated all AI crawlers the same. That was a mistake. They fall into two very different categories:

Training bots scrape your content to train AI models:

GPTBot (OpenAI)
ClaudeBot (Anthropic)
Google-Extended (Google)
Applebot-Extended (Apple)
Bytespider (ByteDance)
CCBot (Common Crawl)

Search bots fetch your pages in real time to answer user queries and cite you:

OAI-SearchBot (OpenAI — powers ChatGPT search results)
ChatGPT-User (OpenAI — fetches pages during live conversations)
Claude-SearchBot (Anthropic — powers Claude's web search)
PerplexityBot (Perplexity — indexes for AI search)

The distinction matters. If you block the search bots, you won't get cited when someone asks ChatGPT or Perplexity for a recommendation in your space. That's live traffic you're turning away.

Training bots are a different calculation. Some site owners are comfortable contributing to model training; others aren't. That's a legitimate choice. But blocking training bots doesn't necessarily remove you from AI answers — models are already trained on historical data, and search bots work independently.

A robots.txt that works for AI visibility

Here's what I recommend as a starting point. It allows all search-related AI bots while giving you explicit control over training bots:

# Search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI search bots — allow these for AI citation visibility
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# AI training bots — your call on these
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

# Block training-only bots you're less comfortable with
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# Default: allow everything else
User-agent: *
Allow: /

# Sitemap
Sitemap: https://yoursite.com/sitemap.xml

If you want AI visibility but don't want to contribute training data, you can Disallow the training bots while keeping the search bots open. Just know that the line between training and search is blurry and getting blurrier — OpenAI's GPTBot description says it's for "improving AI models," but model improvements directly affect how well ChatGPT cites you in the future.

My take: unless you have a specific reason to block training bots, allow them all. In my experience, sites that allow both training and search bots tend to get cited more consistently than sites that only allow search bots — though I'll admit the sample size is small and I'm still tracking this.

Verifying it's actually working

This bit tripped me up at first — I updated a client's robots.txt and assumed we were done. Took me a week to realize the CDN was still blocking at the edge. Always verify with server logs:

# Check for AI crawler activity in the last 7 days
grep -iE "gptbot|oai-searchbot|chatgpt-user|claudebot|claude-searchbot|perplexitybot" /var/log/nginx/access.log | tail -20

If you're on a managed hosting platform without raw log access, check:

Cloudflare: Security → Bots → look for verified bot traffic
Vercel: Analytics → check for known bot user agents
GA4: Won't show bot traffic directly, but watch for referrals from chatgpt.com, perplexity.ai, gemini.google.com

A few things I've noticed in the logs:

AI crawlers hit fewer pages than Googlebot, but spend more time per page
They tend to favor pages with structured data and clean HTML
ChatGPT-User shows up in bursts — someone is asking ChatGPT about your topic and it's fetching your page live
If you see OAI-SearchBot hitting your site regularly, that's a good sign — you're being indexed for ChatGPT search

Don't panic if you don't see activity immediately. AI crawlers don't re-index on a fixed schedule. Give it 2-4 weeks after opening up your robots.txt before expecting consistent crawler traffic.

What I've seen happen after unblocking

One thing I didn't expect: the effects aren't instant, but they compound. After unblocking AI crawlers on a few client sites, we noticed OAI-SearchBot started hitting pages within 1-2 weeks. Actual citations in ChatGPT responses took another 2-4 weeks after that.

But the interesting part was what happened to sites that stayed blocked. We ran the same queries monthly, and sites that were blocked for 6+ months essentially didn't exist in AI answers — even when their content was objectively better than what was getting cited. The crawlers had built indexing patterns around the sites that were consistently accessible, and the blocked sites had no history to draw on.

It's similar to how Googlebot works — if your site has been returning 403s for months, you don't just flip a switch and rank tomorrow. There's a trust ramp.

Quick note: robots.txt is a request, not a wall

Well-behaved crawlers (GPTBot, ClaudeBot, PerplexityBot) respect robots.txt. But it's not a security mechanism. If you need granular control over AI training specifically, look into X-Robots-Tag: noai, noimageai headers or <meta name="robots" content="noai"> for page-level opt-out.

I've been digging into this stuff while building PageX, and robots.txt misconfiguration is genuinely the most common issue we see — more than bad schema, more than thin content, more than any of the fancy optimization stuff. The boring infrastructure problem is usually the one that matters most.

Has anyone here found surprising blocks in their robots.txt? Or noticed AI crawler activity change after opening things up? Curious what patterns others are seeing in their logs.

Top comments (2)

david duymelinck • Mar 31

There is a issue with your robots.txt example. The last rule allows all bots. So that mitigates the need for adding the individual bots.

In this time of having endless bots scraping websites a whitelist in robots.txt is a good thing. The only problem is that the "private" bots just don't care about robots.txt and scrape the website without consent.

Nikhil Goyal • Apr 1

ha good catch on the wildcard rule — you're right that ordering makes those individual allows redundant. should've been clearer there, the intent was more "here's who you're letting in" as documentation vs actual access control. and yeah totally agree on the private bots ignoring it entirely, robots.txt is basically an honor system at this point. appreciate you flagging that!