A friend messaged me last week asking why his documentation site wasn't showing up in ChatGPT's search results. He'd been doing all the right things. Good content, proper meta tags, decent domain authority.
Took me about 30 seconds to find the problem. His robots.txt was blocking GPTBot. Not intentionally. His hosting provider's default template included a block on several AI crawlers, and he never noticed.
Turns out this is way more common than you'd think.
The New Crawlers You Probably Don't Know About
Most developers know about Googlebot and Bingbot. But there's a whole new generation of AI crawlers that are indexing the web for LLM training and AI search products. And if your robots.txt is blocking them, your content is invisible to a growing chunk of how people find information.
Here are the ones that matter right now:
- GPTBot (OpenAI) - Powers ChatGPT search
- ChatGPT-User (OpenAI) - ChatGPT browsing mode
- ClaudeBot (Anthropic) - Claude's web access
- PerplexityBot (Perplexity) - Perplexity AI search
- Bytespider (TikTok/ByteDance) - AI training
- CCBot (Common Crawl) - Used by many AI companies
- Google-Extended - Gemini training (separate from Googlebot)
According to OpenAI's documentation, GPTBot respects robots.txt directives. Same for ClaudeBot per Anthropic's docs. So if you block them, they actually stay away.
Check Your robots.txt Right Now
Go look at yoursite.com/robots.txt. Seriously, do it right now. I'll wait.
Here's what a problematic robots.txt looks like:
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
See that? The catch-all User-agent: * allows everything, but then specific rules block the AI crawlers. This is surprisingly common in default configs from hosting providers and CMS platforms.
Some WordPress security plugins add AI crawler blocks by default. Cloudflare has AI bot blocking as an option thats easy to turn on accidentally. And a bunch of robots.txt generators from 2024 include AI blocks because there was a big wave of "protect your content from AI training" sentiment.
A Quick Audit Script
Here's a script i use to check if a site is blocking AI crawlers:
// Check if a site blocks AI crawlers
const AI_CRAWLERS = [
'GPTBot',
'ChatGPT-User',
'ClaudeBot',
'PerplexityBot',
'Bytespider',
'CCBot',
'Google-Extended',
'Amazonbot',
'anthropic-ai',
'FacebookBot',
];
interface CrawlerStatus {
crawler: string;
blocked: boolean;
rule: string | null;
}
async function checkAICrawlerAccess(domain: string): Promise<CrawlerStatus[]> {
const robotsUrl = `https://${domain}/robots.txt`;
const response = await fetch(robotsUrl);
if (!response.ok) {
// No robots.txt means everything is allowed
return AI_CRAWLERS.map(c => ({ crawler: c, blocked: false, rule: null }));
}
const robotsTxt = await response.text();
const results: CrawlerStatus[] = [];
for (const crawler of AI_CRAWLERS) {
const blocked = isBlocked(robotsTxt, crawler, '/');
results.push({
crawler,
blocked,
rule: blocked ? findMatchingRule(robotsTxt, crawler) : null,
});
}
return results;
}
function isBlocked(robotsTxt: string, userAgent: string, path: string): boolean {
const lines = robotsTxt.split('\n');
let inAgentBlock = false;
let isDisallowed = false;
for (const line of lines) {
const trimmed = line.trim().toLowerCase();
if (trimmed.startsWith('user-agent:')) {
const agent = trimmed.replace('user-agent:', '').trim();
inAgentBlock = agent === userAgent.toLowerCase() || agent === '*';
}
if (inAgentBlock && trimmed.startsWith('disallow:')) {
const disallowPath = trimmed.replace('disallow:', '').trim();
if (disallowPath === '/' || path.startsWith(disallowPath)) {
isDisallowed = true;
}
}
}
return isDisallowed;
}
Note: this is simplified. Real robots.txt parsing has a lot of edge cases with wildcards and precedence rules. But for a quick check it works fine.
The Numbers That Should Scare You
ChatGPT has over 200 million weekly active users as of early 2025. Perplexity handles millions of queries daily. These are real traffic sources now, not just novelty toys.
If your site is blocked from GPTBot, none of those ChatGPT search users will ever see your content. Its like blocking Googlebot in 2010. You could do it, but why would you?
This is exactly why I built the crawler analysis feature in SiteCrawlIQ. I ran it on about 200 developer-focused sites last month and nearly 30% had at least one major AI crawler blocked, about half of those blocks were unintentional (the site owner didn't know). You can check yours in about 30 seconds.
When You SHOULD Block AI Crawlers
Not gonna lie, there are legitimate reasons to block some AI crawlers:
- Protecting proprietary content: If your behind a paywall, you probably dont want AI models training on your premium content
- Bandwidth concerns: Some AI crawlers are aggressive and can spike your server costs
- Legal/compliance: Some industries have data sharing restrictions
But for most developer blogs, documentation sites, and SaaS landing pages? You WANT these crawlers to access your content. Its free visibility.
The Recommended Setup
Here's what i recommend for most sites:
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
# Allow all AI crawlers for search visibility
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
# Block aggressive training-only crawlers if you want
User-agent: Bytespider
Disallow: /
The key insight is to be intentional about it. Dont just accept whatever default your hosting provider gives you. Actually decide which crawlers you want accessing your content and why.
Also Check Your HTTP Headers
robots.txt isn't the only way crawlers get blocked. Some CDNs and WAFs block AI crawlers at the HTTP level using the X-Robots-Tag header or by checking user agents and returning 403s.
Check your server logs for requests from AI crawler user agents. If you see a bunch of 403 responses, your WAF might be blocking them even though your robots.txt allows access.
// Quick check if your server is actually serving content to AI crawlers
async function testCrawlerAccess(url: string, crawler: string) {
const response = await fetch(url, {
headers: {
'User-Agent': `${crawler}/1.0`,
},
});
console.log(`${crawler}: ${response.status} ${response.statusText}`);
const xRobotsTag = response.headers.get('x-robots-tag');
if (xRobotsTag) {
console.log(` X-Robots-Tag: ${xRobotsTag}`);
}
}
Do This Today
- Check your robots.txt for AI crawler blocks
- Check your CDN/WAF settings for bot blocking rules
- Review any WordPress plugins that might be adding blocks
- Decide intentionally which AI crawlers you want to allow
- Monitor your server logs for AI crawler access patterns
The web is changing fast. AI search is a legitimate traffic channel now and its only getting bigger. Make sure your not accidentally hiding from it.
Top comments (0)