GrimLabs

Posted on Apr 18

How to Check If You

#seo #robotstxt #ai #crawlers

A friend messaged me last week asking why his documentation site wasn't showing up in ChatGPT's search results. He'd been doing all the right things. Good content, proper meta tags, decent domain authority.

Took me about 30 seconds to find the problem. His robots.txt was blocking GPTBot. Not intentionally. His hosting provider's default template included a block on several AI crawlers, and he never noticed.

Turns out this is way more common than you'd think.

The New Crawlers You Probably Don't Know About

Most developers know about Googlebot and Bingbot. But there's a whole new generation of AI crawlers that are indexing the web for LLM training and AI search products. And if your robots.txt is blocking them, your content is invisible to a growing chunk of how people find information.

Here are the ones that matter right now:

GPTBot (OpenAI) - Powers ChatGPT search
ChatGPT-User (OpenAI) - ChatGPT browsing mode
ClaudeBot (Anthropic) - Claude's web access
PerplexityBot (Perplexity) - Perplexity AI search
Bytespider (TikTok/ByteDance) - AI training
CCBot (Common Crawl) - Used by many AI companies
Google-Extended - Gemini training (separate from Googlebot)

According to OpenAI's documentation, GPTBot respects robots.txt directives. Same for ClaudeBot per Anthropic's docs. So if you block them, they actually stay away.

Check Your robots.txt Right Now

Go look at yoursite.com/robots.txt. Seriously, do it right now. I'll wait.

Here's what a problematic robots.txt looks like:

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

See that? The catch-all User-agent: * allows everything, but then specific rules block the AI crawlers. This is surprisingly common in default configs from hosting providers and CMS platforms.

Some WordPress security plugins add AI crawler blocks by default. Cloudflare has AI bot blocking as an option thats easy to turn on accidentally. And a bunch of robots.txt generators from 2024 include AI blocks because there was a big wave of "protect your content from AI training" sentiment.

A Quick Audit Script

Here's a script i use to check if a site is blocking AI crawlers:

// Check if a site blocks AI crawlers
const AI_CRAWLERS = [
  'GPTBot',
  'ChatGPT-User',
  'ClaudeBot',
  'PerplexityBot',
  'Bytespider',
  'CCBot',
  'Google-Extended',
  'Amazonbot',
  'anthropic-ai',
  'FacebookBot',
];

interface CrawlerStatus {
  crawler: string;
  blocked: boolean;
  rule: string | null;
}

async function checkAICrawlerAccess(domain: string): Promise<CrawlerStatus[]> {
  const robotsUrl = `https://${domain}/robots.txt`;
  const response = await fetch(robotsUrl);

  if (!response.ok) {
    // No robots.txt means everything is allowed
    return AI_CRAWLERS.map(c => ({ crawler: c, blocked: false, rule: null }));
  }

  const robotsTxt = await response.text();
  const results: CrawlerStatus[] = [];

  for (const crawler of AI_CRAWLERS) {
    const blocked = isBlocked(robotsTxt, crawler, '/');
    results.push({
      crawler,
      blocked,
      rule: blocked ? findMatchingRule(robotsTxt, crawler) : null,
    });
  }

  return results;
}

function isBlocked(robotsTxt: string, userAgent: string, path: string): boolean {
  const lines = robotsTxt.split('\n');
  let inAgentBlock = false;
  let isDisallowed = false;

  for (const line of lines) {
    const trimmed = line.trim().toLowerCase();

    if (trimmed.startsWith('user-agent:')) {
      const agent = trimmed.replace('user-agent:', '').trim();
      inAgentBlock = agent === userAgent.toLowerCase() || agent === '*';
    }

    if (inAgentBlock && trimmed.startsWith('disallow:')) {
      const disallowPath = trimmed.replace('disallow:', '').trim();
      if (disallowPath === '/' || path.startsWith(disallowPath)) {
        isDisallowed = true;
      }
    }
  }

  return isDisallowed;
}

Note: this is simplified. Real robots.txt parsing has a lot of edge cases with wildcards and precedence rules. But for a quick check it works fine.

The Numbers That Should Scare You

ChatGPT has over 200 million weekly active users as of early 2025. Perplexity handles millions of queries daily. These are real traffic sources now, not just novelty toys.

If your site is blocked from GPTBot, none of those ChatGPT search users will ever see your content. Its like blocking Googlebot in 2010. You could do it, but why would you?

This is exactly why I built the crawler analysis feature in SiteCrawlIQ. I ran it on about 200 developer-focused sites last month and nearly 30% had at least one major AI crawler blocked, about half of those blocks were unintentional (the site owner didn't know). You can check yours in about 30 seconds.

When You SHOULD Block AI Crawlers

Not gonna lie, there are legitimate reasons to block some AI crawlers:

Protecting proprietary content: If your behind a paywall, you probably dont want AI models training on your premium content
Bandwidth concerns: Some AI crawlers are aggressive and can spike your server costs
Legal/compliance: Some industries have data sharing restrictions

But for most developer blogs, documentation sites, and SaaS landing pages? You WANT these crawlers to access your content. Its free visibility.

The Recommended Setup

Here's what i recommend for most sites:

User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

# Allow all AI crawlers for search visibility
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Block aggressive training-only crawlers if you want
User-agent: Bytespider
Disallow: /

The key insight is to be intentional about it. Dont just accept whatever default your hosting provider gives you. Actually decide which crawlers you want accessing your content and why.

Also Check Your HTTP Headers

robots.txt isn't the only way crawlers get blocked. Some CDNs and WAFs block AI crawlers at the HTTP level using the X-Robots-Tag header or by checking user agents and returning 403s.

Check your server logs for requests from AI crawler user agents. If you see a bunch of 403 responses, your WAF might be blocking them even though your robots.txt allows access.

// Quick check if your server is actually serving content to AI crawlers
async function testCrawlerAccess(url: string, crawler: string) {
  const response = await fetch(url, {
    headers: {
      'User-Agent': `${crawler}/1.0`,
    },
  });

  console.log(`${crawler}: ${response.status} ${response.statusText}`);

  const xRobotsTag = response.headers.get('x-robots-tag');
  if (xRobotsTag) {
    console.log(`  X-Robots-Tag: ${xRobotsTag}`);
  }
}

Do This Today

Check your robots.txt for AI crawler blocks
Check your CDN/WAF settings for bot blocking rules
Review any WordPress plugins that might be adding blocks
Decide intentionally which AI crawlers you want to allow
Monitor your server logs for AI crawler access patterns

The web is changing fast. AI search is a legitimate traffic channel now and its only getting bigger. Make sure your not accidentally hiding from it.

DEV Community