Parsing robots.txt for 10 AI Crawlers: Wildcards, Partial Blocks, Line Numbers

#webscraping #webdev #ai #llm

robots.txt parsing looks like a weekend job. It is a flat text file. Each line is a directive. You split on the colon, match the user agent, check whether a path is disallowed. How hard can it be.

Then you start feeding it real files. You hit a group that opens with three User-agent lines and one rule block. You hit a Disallow: /*? that means more than its author thought. You hit a file that 404s over HTTPS but loads over HTTP. You hit comments mid-line, mixed casing, and a Disallow: with nothing after it. The weekend job grows teeth.

We built the AI Crawler Checker to answer one narrow question well: for a given domain, which of the major AI crawlers can read it, and which cannot. We grade against ten specific user agents:

GPTBot: ChatGPT and OpenAI, training and search
ChatGPT-User: ChatGPT live browsing
Google-Extended: Gemini and Google AI Overviews grounding
Googlebot: Google Search and AI Overviews
PerplexityBot: Perplexity
Anthropic-AI: Claude training
ClaudeBot: Claude web crawler
Bytespider: ByteDance and TikTok
CCBot: Common Crawl, which feeds many AI training sets
Applebot-Extended: Apple Intelligence

This is a write-up of the parts that were not trivial.

Grouping directives by user-agent

The thing that trips up a naive parser is that robots.txt is not a flat list of rules. It is a sequence of groups. A group opens with one or more User-agent lines, and the rule lines that follow apply to every user agent named in that opening run. So this:

User-agent: GPTBot
User-agent: CCBot
Disallow: /private/

is one group of rules shared by two bots, not two separate groups. The * group is the fallback that applies to any agent without its own group. Get the grouping wrong and you misattribute every rule.

The parser we settled on tracks state as it reads. While it is still seeing User-agent lines, it accumulates names. The first non-User-agent directive closes the agent list and starts collecting rules for all of them. Stripped down, it looks like this:

def parse_groups(lines):
    groups = []
    agents, rules = [], []
    reading_agents = False
    for lineno, raw in enumerate(lines, start=1):
        line = strip_comment(raw).strip()
        if not line:
            continue
        field, _, value = line.partition(":")
        field = field.strip().lower()
        value = value.strip()
        if field == "user-agent":
            if rules:                       # previous group closed
                groups.append((agents, rules))
                agents, rules = [], []
            agents.append(value.lower())
            reading_agents = True
        elif field in ("allow", "disallow"):
            reading_agents = False
            rules.append((field, value, lineno))
    if agents:
        groups.append((agents, rules))
    return groups

Illustrative, not the production code, but the shape is real. The detail that earns its keep is lineno, carried on every rule from the moment it is read. More on that below.

Three verdicts, not two

The obvious model is binary: a bot is allowed or it is blocked. Real files do not split that cleanly, so we report three states: allowed, blocked, and partial.

Blocked is a blanket Disallow: /. The bot gets nothing:

User-agent: Bytespider
Disallow: /

Partial is a scoped disallow. The bot can read most of the site but is shut out of specific paths:

User-agent: Googlebot
Disallow: /admin/
Disallow: /cart/

That is "partial" rather than "blocked", and the distinction is the whole point. A bot blocked only from /admin/ is fine. A bot blocked from /docs/ might be cut off from exactly the content you want it to read. Collapsing partial into blocked would cry wolf; collapsing it into allowed would hide a real problem. Three states is the smallest model that tells the truth, so that is what we report, with the matching path shown for every partial.

Line-number attribution

This was the UX decision that mattered most, and it shaped the parser.

A verdict of "GPTBot: blocked" is technically correct and operationally useless. The person reading it now has to open robots.txt, scan it, figure out which group applies to GPTBot, and find the offending line. For a long file with shared groups and wildcards, that is a few minutes of grepping and a decent chance of editing the wrong line.

A verdict of "GPTBot: blocked by Disallow: / on line 42" is a different object. It points at one line. The fix is one edit. There is nothing to investigate.

The implementation note that makes this work: you have to track line numbers during the parse, not reconstruct them after. Once you have normalized a file into groups and rules, the original line positions are gone, and trying to find them again by re-matching strings is fragile the moment the file has duplicate directives. So the line number rides along with each rule from the first read, as the lineno in that tuple above. It costs nothing to carry and it is impossible to recover later, which is the whole argument for doing it up front.

Fetch fallbacks and the messy middle

Before you parse anything you have to fetch the file, and fetching is where the real-world mess starts.

We try HTTPS first and fall back to plain HTTP, because a non-trivial number of sites serve robots.txt correctly over one protocol and 404 over the other. A checker that only tries HTTPS reports "no rules" for a site that has plenty.

Then there is the missing file. If robots.txt does not exist at all, the spec is clear: absence means allowed. No file is permission for everything, so a 404 over both protocols resolves to allowed-by-default for all ten bots, not to an error.

And the file itself is rarely clean. Comments appear at the end of otherwise valid lines, so you strip from the first # before parsing the directive. Field names arrive in every casing, so User-agent, user-agent, and USER-AGENT all have to match, which is why the parser lowercases the field. None of this is hard individually. It is the accumulation that turns a flat-file parser into something you actually test against captured real-world files.

What this method cannot see

The honest boundary, stated in the tool rather than buried: this reads robots.txt and nothing else. robots.txt is permission, not discovery, and permission is only one layer of access control.

If a site blocks a crawler at the CDN or WAF, or by IP, or through a bot-management product, none of that lives in robots.txt. It happens at the network edge, and our checker never sees it. So a clean report means your robots.txt is not blocking the bot. It does not prove the bot can reach you. We say exactly that in the result rather than letting a green checkmark overclaim, because a tool that quietly overstates what it knows is worse than no tool.

Try it

If you want to run your own domain through it, the AI Crawler Checker is free and takes a root domain, no signup. You get the per-bot verdict, the directive, and the line number for all ten crawlers.

The companion on the discovery side is the Sitemap Checker, which validates that your sitemap is findable and healthy. Permission and discovery are different problems, and a bot you have allowed in robots.txt still has to be able to find your pages. Geology is a full-stack SEO and GEO agency, and these free tools are the same checks we run in client audits, factored out so you can run them yourself.

Mehul Jain is an AI entrepreneur and product builder. He works on Geology, a GEO platform.