DEV Community

SEN LLC
SEN LLC

Posted on

Python's urllib.robotparser Is Subtly Wrong — and Why That Matters for SEO

Python's urllib.robotparser Is Subtly Wrong — and Why That Matters for SEO

A pre-deploy linter for robots.txt that also shows you where Python's stdlib parser disagrees with RFC 9309 and, therefore, with Googlebot.

robots.txt is the first file most SEO-aware backend engineers ship and the last file they ever look at again. It's three directives, a handful of paths, and a vague memory that "Google's bot honors it." Until one day Search Console tells you a critical page has been blocked for a week, and you stare at your 12-line file and cannot figure out which rule is responsible.

🔗 GitHub: https://github.com/sen-ltd/robots-lint

screenshot

robots-lint is a small Python CLI that does two things. First, it lints robots.txt files for the common mistakes — wildcards in User-agent, missing trailing slashes on directory paths, conflicting Allow/Disallow pairs, UTF-8 BOMs, unknown directives, and a few others. Second, it implements RFC 9309's matching algorithm in pure Python (about 120 lines in matcher.py) and lets you ask "can this user-agent fetch this URL?" against a local file or a fetched URL. If you pass --compare-parsers, it runs the same question through both the RFC 9309 matcher and Python's stdlib urllib.robotparser, and flags any disagreement.

The --compare-parsers mode is the reason I built this. Python's stdlib parser is subtly wrong, in a way that matters for real sites, and nobody talks about it.

The problem: robots.txt has way more edge cases than you think

Here is an abbreviated list of things that make robots.txt parsing non-trivial:

  • Case sensitivity. Directive names are case-insensitive (USER-AGENT is fine) but the values in User-agent: and the path in Disallow: are case-sensitive.
  • Line endings and BOM. Plenty of production robots.txt files are served with CRLF or with a stray UTF-8 BOM. Some parsers handle them, some don't.
  • Wildcards in paths. * is a glob that matches zero or more of any character. $ is an end-of-match anchor. These are not in the original 1994 robots spec — they're Google extensions that RFC 9309 later standardized.
  • Group selection. Crawlers pick the most specific group matching their user-agent token. Googlebot-News takes precedence over Googlebot takes precedence over *.
  • Rule precedence within a group. This is the one everyone gets wrong. RFC 9309 section 2.2.2 mandates longest-match-wins: among all rules that match a URL in the selected group, the rule with the longest pattern wins. If two rules match with equal length, Allow beats Disallow.
  • Crawl-delay. Not in RFC 9309. Some crawlers honor it, some ignore it, some misparse it.
  • Host. Yandex-only, deprecated, but you still find it in the wild.
  • Redirects. If you fetch robots.txt and it 301s to /robots/, different crawlers treat that differently.

Most sites hand-write their robots.txt once, commit it, and never think about it again. The problems only surface when something has been quietly disallowed for Googlebot for long enough that Search Console notices. A linter catches the common mistakes before they ship.

The interesting bug: Python's stdlib parser is first-match, not longest-match

Here is the thing that surprised me, and is the reason this project has a --compare-parsers mode at all.

Python ships with urllib.robotparser in the standard library. It has been there since Python 2. It is widely used — anything from quick scrapers to production crawlers reach for it without a second thought. And it is wrong, in a way that matters.

Consider this robots.txt:

User-agent: *
Disallow: /admin/
Allow: /admin/public/
Enter fullscreen mode Exit fullscreen mode

You have an admin area, but one subdirectory (/admin/public/) is intentionally public. Under RFC 9309, the rule for /admin/public/page is clear: the longest matching rule is Allow: /admin/public/ (16 characters), which beats Disallow: /admin/ (7 characters). Googlebot will happily crawl /admin/public/page. That's the whole reason you write the rules this way.

Now ask Python:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.parse([
    "User-agent: *",
    "Disallow: /admin/",
    "Allow: /admin/public/",
])
print(rp.can_fetch("Googlebot", "http://example.com/admin/public/page"))
# → False
Enter fullscreen mode Exit fullscreen mode

Python says False. Googlebot says True. Your Python-based crawler and Google disagree.

The reason is that Python's RobotFileParser walks the rules in source order and returns the first one that matches. Reorder the file — put Allow: /admin/public/ before Disallow: /admin/ — and Python starts saying True. RFC 9309 says order must not matter; only the longest match wins, with Allow breaking ties. Python doesn't implement that.

This is not a bug in your robots.txt. This is a bug in Python's parser. And if you've ever used urllib.robotparser to pre-check URLs before hitting someone's site — which is exactly what the Python documentation suggests — you've been getting wrong answers on files that look completely normal.

The matcher: RFC 9309 in 80 lines

Here's the heart of matcher.py. It's a pure function from (AST, user_agent, url) → Decision, no IO, no global state:

def decide(ast: RobotsAST, user_agent: str, url: str) -> Decision:
    group = _select_group(ast, user_agent)
    path = _url_path(url)

    if group is None:
        return Decision(True, None, None, "no group matched — default allow")

    best_rule = None
    best_len = -1
    best_is_allow = False

    for rule in group.rules:
        if rule.kind == "disallow" and rule.path == "":
            # An empty Disallow line means "allow everything".
            if best_len < 0:
                best_rule, best_len, best_is_allow = rule, 0, True
            continue

        m = _match_length(rule, path)
        if m < 0:
            continue
        is_allow = rule.kind == "allow"
        if m > best_len:
            best_rule, best_len, best_is_allow = rule, m, is_allow
        elif m == best_len and is_allow and not best_is_allow:
            # Allow wins ties per RFC 9309 section 2.2.2.
            best_rule, best_is_allow = rule, True

    allowed = best_is_allow if best_rule else True
    return Decision(allowed, best_rule, group, _reason(best_rule, best_is_allow, best_len))
Enter fullscreen mode Exit fullscreen mode

_match_length compiles the rule's path pattern to a regex (expanding * to .* and anchoring $ at end-of-match) and returns the length of the pattern in octets if it matches, or -1 otherwise. The $ anchor doesn't count toward the length because the spec says specificity is measured in matched octets, not regex syntax.

Everything else — the parser, the linter, the comparator — is built around this one function. Separating the pure matcher from the IO made it trivial to write high-confidence tests: give it a known robots.txt, a user-agent, and a URL, and assert the decision. I lifted several of the test cases directly from RFC 9309 section 2.2.2.

The linter: static checks

The linter runs after the parser and emits a list of Finding objects with severity error, warning, or info. Here's the detection for wildcards in User-agent lines, one of the most common mistakes:

for group in ast.groups:
    for ua in group.user_agents:
        if "*" in ua and ua.strip() != "*":
            findings.append(Finding(
                line_no=group.start_line,
                severity="warning",
                code="wildcard-ua",
                message=(
                    f'wildcard in User-agent line: "User-agent: {ua}"'
                    "specific crawlers are preferred"
                ),
            ))
Enter fullscreen mode Exit fullscreen mode

User-agent: *bot looks like it should match "any bot", and some very old crawlers did treat it that way, but RFC 9309 is explicit: the only valid wildcard user-agent is bare *. Every other entry is matched as a literal product token prefix. So User-agent: *bot matches exactly zero real crawlers. It's pure dead code, and every time I see it in a real site's robots.txt, I know the author was guessing at the spec.

The parser itself is small and boring. The one subtlety is that consecutive User-agent lines should form a single group — so User-agent: Googlebot followed by User-agent: Googlebot-News followed by Disallow: /x means both bots get the rule. But if a non-User-agent directive comes in between, the next User-agent opens a new group. The parser tracks this with a single boolean:

if name == "user-agent":
    if current is None or just_saw_rule:
        current = Group(start_line=idx)
        ast.groups.append(current)
        just_saw_rule = False
    current.user_agents.append(value)
Enter fullscreen mode Exit fullscreen mode

That one-boolean state machine is the whole grouping logic. I was ready for it to be hairy. It wasn't.

Trade-offs and things I intentionally skipped

Crawl-delay is not in RFC 9309. It's in the de facto robots.txt spec that predates the RFC, and Bing honors it, and Yandex honors it, and Google ignores it and uses Search Console's crawl rate setting instead. The linter recognizes the directive and flags it as info so you know it's not part of the standard, but it doesn't try to validate the value's units. Some sites write Crawl-delay: 10, some write Crawl-delay: 0.5, some write Crawl-delay: 10s. There is no consistent answer.

Host: is Yandex-only. I flag it as info. I briefly considered rejecting it outright, but it's still in production at enough Russian-market sites that a warning would be annoying.

User-agent matching is prefix-based on the product token, not substring. RFC 9309 says to compare the token before the first slash, case-insensitive. So Googlebot/2.1 (+http://www.google.com/bot.html) matches a group with User-agent: Googlebot. But User-agent: bot in the file would NOT match Googlebot — it's a product-token comparison, not a substring search. The matcher implements this correctly.

Redirects are not followed when fetching a URL source. If example.com/robots.txt 301s, httpx follows the redirect (with follow_redirects=True) and lints the final body, but the linter doesn't flag the chain or warn about it. In a future version I'd want to lint the redirect chain itself — some crawlers refuse to follow more than 5 hops.

No network in tests. httpx has a MockTransport that lets you inject a handler function and intercept requests without binding a port. Every test that touches fetcher.load uses a mock transport. The test suite runs offline and in under a second.

Try it in 30 seconds

git clone https://github.com/sen-ltd/robots-lint
cd robots-lint
docker build -t robots-lint .

# Lint a fixture with several known issues:
docker run --rm -v "$(pwd)/tests/fixtures:/work" robots-lint sample.robots.txt

# Ask whether Googlebot can fetch a URL:
docker run --rm -v "$(pwd)/tests/fixtures:/work" robots-lint sample.robots.txt \
  --fail-on error --test Googlebot /private/public/page

# Show where Python's stdlib disagrees with the spec:
docker run --rm -v "$(pwd)/tests/fixtures:/work" robots-lint sample.robots.txt \
  --fail-on error --compare-parsers Googlebot /private/public/page
Enter fullscreen mode Exit fullscreen mode

The last command is the one I ran maybe fifty times while writing this. The fixture sample.robots.txt has a Googlebot-specific group with Disallow: /private/ followed by Allow: /private/public/. RFC 9309 says /private/public/page is allowed (longer Allow wins). Python's stdlib says it is disallowed (first-match hit Disallow: /private/).

RFC 9309 matcher:   ALLOWED  (Allow: '/private/public/' matched (length 16, line 11))
urllib.robotparser: DISALLOWED
⚠ disagreement — RFC 9309 says allowed, stdlib says disallowed for Googlebot /private/public/page
Enter fullscreen mode Exit fullscreen mode

That's the bug. Now you can spot it in your own files before Google does.

What I'd do differently next time

If I were starting over, I'd build the comparator first and the linter second. The linter is the sellable thing, but the comparator is the interesting thing — it turned out to be the cleanest way to explain RFC 9309 to myself. "Here are two implementations. When do they disagree? Why?" is a better teacher than any spec reading.

I'd also be more aggressive about the default exit code. The current default is --fail-on warning, which means a file with a single wildcard-ua warning fails CI. That's the right default for pre-deploy gating but it can bite you on legacy sites. Maybe --fail-on error by default with a --strict alias for warning.

For now, the tool does what I wanted it to do: it answers "will Googlebot fetch this URL?" correctly, it shows me where Python's stdlib is wrong, and it finds the kinds of mistakes that you don't notice until Search Console complains.

robots-lint is MIT-licensed, Dockerized, and has a 57-case test suite that encodes every edge case I could find in RFC 9309. Drop it in a pre-deploy CI step and stop guessing.

Top comments (0)