DEV Community

SEN LLC
SEN LLC

Posted on

Writing a Polite Crawler: sitemap-gen for the Sites SSGs Forgot

Writing a Polite Crawler: sitemap-gen for the Sites SSGs Forgot

A small async CLI that crawls a site and emits a valid sitemap.xml. Written for legacy sites, hand-maintained CMSes, and the long tail of sites that don't have a build-time sitemap plugin. The interesting part wasn't the XML — it was deciding how to be polite.

📦 GitHub: https://github.com/sen-ltd/sitemap-gen

Screenshot

The problem nobody talks about

If you work with modern static-site generators, sitemaps are a solved problem. Astro has @astrojs/sitemap. Next.js has next-sitemap. Hugo emits one by default. You never think about it.

But there is a long tail of sites where sitemaps are not a solved problem:

  • A WordPress install whose SEO plugin stopped working three years ago.
  • A PHP site someone's uncle built in 2011 and that still quietly serves the leads.
  • A folder of static HTML that lives in an S3 bucket with no build step.
  • A staging environment where you need a sitemap right now for a launch checklist because marketing just found out you don't have one.

For any of these you have three options: pay for Screaming Frog, pay for a SaaS crawler that rate-limits you behind a plan tier, or write one yourself. I wrote one myself — sitemap-gen, a single-file-ish async Python CLI that crawls a site and emits a valid sitemap.xml in under 30 seconds for a typical small site.

The code is short. The interesting design question was: how do you crawl a site without being a jerk about it? That's what this post is really about.

The contract

Before writing any code I wrote down what the tool must and must not do:

  • Must respect robots.txt by default. Opt-out via a flag, but the default has to be polite.
  • Must stay on the seed origin. No drift to external sites.
  • Must hard-cap both page count and BFS depth. No infinite loops.
  • Must rate-limit itself per-host. No accidental DoS.
  • Must emit XML that matches the actual sitemaps.org schema search engines consume.
  • Must not pull in a heavy parser like BeautifulSoup. Stdlib only, except for httpx.
  • Must not hit the real network in tests.

That last point shaped a lot of the code. Every side-effect path is injectable so tests can wire up an httpx.MockTransport and drive a fake site deterministically.

The module shape

The crawler is four small files:

src/sitemap_gen/
├── url_utils.py   # pure: normalization, origin matching, priority maps
├── extractor.py   # pure: HTML -> list of absolute URLs + noindex flag
├── robots.py      # async: robots.txt loader with per-origin cache
├── crawler.py     # async: BFS loop with rate limit + hard caps
├── emitter.py     # pure: urlset -> xml / json / text
└── cli.py         # argparse glue
Enter fullscreen mode Exit fullscreen mode

The url_utils, extractor, and emitter modules are pure. No I/O, no async. Each takes bytes or strings and returns bytes or strings. That's 80% of the test surface and they run in a few milliseconds.

URL normalization: the most boring bug surface

Half of your crawler bugs will be URL normalization bugs. Not deep ones — trivial, "did you trim the fragment?" bugs that make you re-crawl the same page three ways.

Here's the normalization function in full:

def normalize_url(url: str, *, keep_query: bool = True) -> str:
    url, _ = urldefrag(url)                       # drop #section
    parts = urlsplit(url)
    scheme = parts.scheme.lower()
    host = (parts.hostname or "").lower()

    port = parts.port
    if port is not None and (
        (scheme == "http" and port == 80)
        or (scheme == "https" and port == 443)
    ):
        port = None                                # drop default ports

    netloc = host
    if port is not None:
        netloc = f"{netloc}:{port}"

    path = parts.path or "/"
    while "//" in path:                            # collapse //
        path = path.replace("//", "/")

    query = parts.query if keep_query else ""
    return urlunsplit((scheme, netloc, path, query, ""))
Enter fullscreen mode Exit fullscreen mode

A few decisions worth calling out:

  1. Drop the fragment. #section never makes a URL a different URL from the server's point of view. Same content, same sitemap entry.
  2. Lowercase scheme and host, but not the path. HTTP://Example.COM/AboutMe and http://example.com/AboutMe are the same URL, but http://example.com/AboutMe and http://example.com/aboutme are not. Many servers serve them differently.
  3. Don't drop the trailing slash. On nginx, /blog and /blog/ can resolve to different pages depending on try_files. Conflating them is a common source of broken sitemaps.
  4. Keep the query string by default. This is a calculated risk — query strings can explode your URL space — but dropping them conflates ?page=1, ?page=2, etc. into one entry, which is worse for search engines than a few extra rows.

The fragment and query decisions go the opposite way in different crawlers. sitemap-gen errs on the side of "more entries is better than wrong entries", and lets you trim with --exclude.

Same-origin: www.example.com is not example.com

This one catches almost every crawler author. The naive check is "same host". The correct check is:

@dataclass(frozen=True)
class Origin:
    scheme: str
    host: str
    port: int | None

    def matches(self, other: "Origin") -> bool:
        return (
            self.scheme == other.scheme
            and self.host == other.host
            and self.port == other.port
        )
Enter fullscreen mode Exit fullscreen mode

www.example.com and example.com are different hosts. If the user seeds the crawl at the apex but the site links to www. internally, a naive crawler will follow the link and start emitting www. URLs into what was supposed to be an apex sitemap. That's a real SEO footgun: Google will see the two sets of URLs, pick one as canonical, and ignore the other.

sitemap-gen is strict by default. If you want the www-collapsing behavior, pass --canonical and it'll strip the www. prefix before comparing:

def strip_www(self) -> "Origin":
    host = self.host[4:] if self.host.startswith("www.") else self.host
    return Origin(self.scheme, host, self.port)
Enter fullscreen mode Exit fullscreen mode

This is a small helper but it documents the intent clearly — if Origin.strip_www() is ever called in a hot loop, grep finds it immediately.

Respecting robots.txt without a drama

Python's stdlib has urllib.robotparser.RobotFileParser. It's completely adequate, but it's synchronous. The usual advice for dealing with sync I/O in an async program is "run it in a thread pool", but we don't even need that — robots.txt is a single file fetch per origin, then it's cached in memory forever. We can use httpx to fetch the bytes async and then feed them into the parser:

class RobotsCache:
    def __init__(self, client: httpx.AsyncClient, user_agent: str) -> None:
        self._client = client
        self._ua = user_agent
        self._cache: dict[str, RobotFileParser] = {}

    async def allowed(self, url: str) -> bool:
        key = self._origin_key(url)
        parser = self._cache.get(key)
        if parser is None:
            parser = await self._fetch(key)
            self._cache[key] = parser
        return parser.can_fetch(self._ua, url)
Enter fullscreen mode Exit fullscreen mode

The interesting decision is what to do on failure. RFC 9309 says a 4xx from robots.txt means "no restrictions" and a 5xx means "full disallow until fixed". We follow the 4xx rule strictly but take liberties with 5xx: if the origin's infra is flaky and we disallow everything, we abort crawls on transient blips, which is worse than occasionally crawling a page the owner wanted restricted. Documented, non-default, and flaggable if someone disagrees.

One thing people don't think about: don't treat your own failure to parse robots.txt as denial. If the file is malformed, permissive-default is kinder than crawl-abort.

BFS with real limits

The crawl loop is a bog-standard BFS, with one wrinkle: it batches up to concurrency items at a time and dispatches them as a group before popping the next batch. This keeps depth ordering mostly honest without the complexity of a full per-level barrier.

while queue and len(visited) < config.max_pages:
    batch: list[_QueueItem] = []
    while queue and len(batch) < config.concurrency and (len(visited) + len(batch)) < config.max_pages:
        batch.append(queue.popleft())
    if not batch:
        break

    tasks = [asyncio.create_task(_fetch_one(item)) for item in batch]
    for task in asyncio.as_completed(tasks):
        item, resp, skip = await task
        visited.add(item.url)
        # ... process response, enqueue children at depth+1 ...
Enter fullscreen mode Exit fullscreen mode

Three hard limits are always active:

  1. max_pages — total pages visited. Default 5000. This is the one that saves you from the calendar archive trap.
  2. max_depth — BFS depth from the seed. Default 10. This is the one that saves you from /page/1, /page/2, ..., /page/9999 pagination hell.
  3. concurrency — max in-flight requests. Default 8. This is the one that saves you from the hosting company's rate limiter.

The calendar archive trap

Old blogs have calendar widgets. "< Older" and "Newer >" navigation. "Posts from September 2003". If your crawler follows every "< Older" link, and the site keeps generating older archive pages all the way back to 1970, you will crawl forever.

The trap is that these pages are not identical, so URL deduping doesn't save you. Every month is a new path. You need either a depth limit, a page-count limit, or a pattern-based skip. sitemap-gen gives you all three: --max-depth, --max-pages, and --exclude '/archive/*'.

The depth limit alone catches most of these. If you're 10 hops deep from the seed, you are almost certainly down a long tail the user doesn't care about.

Filtering: what doesn't belong in a sitemap

Four categories of pages get fetched but not emitted:

  • Non-HTML responses. Content-Type filtering. If the server says application/pdf, we don't want it in the sitemap even though the crawler dutifully fetched it.
  • 4xx/5xx responses. Broken. Putting broken pages in a sitemap is an active negative signal to search engines.
  • <meta name="robots" content="noindex">. The page literally asked not to be indexed. Rude to override that.
  • Redirects off-origin. If a link goes to /out?url=https://evil.com and the server redirects us there, we're no longer on the user's site. Drop it.

The noindex check is in the extractor:

def handle_starttag(self, tag: str, attrs):
    # ...
    elif tag == "meta":
        name = attr_map.get("name", "").lower()
        if name in ("robots", "googlebot"):
            content = attr_map.get("content", "").lower()
            tokens = [t.strip() for t in content.split(",")]
            if "noindex" in tokens or "none" in tokens:
                self.noindex = True
Enter fullscreen mode Exit fullscreen mode

The "none" token is equivalent to "noindex, nofollow" per the robots meta tag spec. Almost nobody uses it, but if you're building a tool that claims to respect robots directives, you should respect all of them.

Emitting valid XML

xml.etree.ElementTree has been in the stdlib forever and is exactly what you need for sitemap.xml:

def emit_xml(entries: list[UrlEntry]) -> str:
    urlset = ET.Element("urlset", {"xmlns": SITEMAP_NS})
    for entry in entries:
        url_el = ET.SubElement(urlset, "url")
        ET.SubElement(url_el, "loc").text = entry.loc
        if entry.lastmod is not None:
            ET.SubElement(url_el, "lastmod").text = entry.lastmod
        if entry.changefreq is not None:
            ET.SubElement(url_el, "changefreq").text = entry.changefreq
        if entry.priority is not None:
            pri_str = f"{entry.priority:.2f}".rstrip("0").rstrip(".")
            ET.SubElement(url_el, "priority").text = pri_str or "0"

    ET.indent(urlset, space="  ")
    body = ET.tostring(urlset, encoding="unicode")
    return '<?xml version="1.0" encoding="UTF-8"?>\n' + body + "\n"
Enter fullscreen mode Exit fullscreen mode

A few notes:

  • ET.indent landed in Python 3.9 and formats the tree in place. Before that you had to pretty-print by hand; now you don't.
  • The namespace matters. Google and Bing both refuse sitemaps that don't declare xmlns="http://www.sitemaps.org/schemas/sitemap/0.9". Round-tripping the output through ET.fromstring in the tests catches this.
  • Priority formatting. The spec wants one decimal, ideally. 0.5, not 0.50. rstrip("0").rstrip(".") is an ugly but correct way to trim trailing zeros without switching to decimal.
  • XML declaration prepended manually. ET.tostring can include one, but then you have to pass xml_declaration=True plus short_empty_elements=False and the output is less diffable.

The tradeoffs I chose not to solve

A few things sitemap-gen deliberately does not do, because each would be a second project:

  • JavaScript rendering. If your site is a React SPA that boots empty and fills itself in, this crawler will see an empty page. The only honest solution is a headless browser (Playwright, Puppeteer), and that's a 10x jump in complexity and image size. Out of scope.
  • <link rel="canonical"> detection. The right way to handle canonicals is: fetch the page, find the canonical, and emit that URL instead. It'd be a nice addition; today the tool just emits whatever URL the crawler arrived at after redirects.
  • Sitemap index files. If you have more than 50,000 URLs you need multiple sitemap files and an index. This tool is for small-to-medium sites. Bigger sites need a different design.
  • Image / video / news extensions. Valid per sitemaps.org spec but rarely needed. Easy to bolt on if someone asks.

Documenting the limits matters as much as documenting the features. The failure mode of "users assume it does something it doesn't" is worse than "users read the README and reach for a bigger tool".

Try it in 30 seconds

docker build -t sitemap-gen https://github.com/sen-ltd/sitemap-gen.git
docker run --rm -v "$PWD":/work sitemap-gen https://example.com/ --out sitemap.xml -v
Enter fullscreen mode Exit fullscreen mode

You'll have a schema-valid sitemap.xml in your cwd. The image is multi-stage Alpine, non-root, under 90 MB.

The whole project is about 700 lines of Python plus tests. The part that took the longest was not the code — it was deciding the right defaults for depth, rate, and robots behavior. Most of the code exists to keep the polite defaults easy to use and the rude overrides explicit.

If you're shipping a sitemap generator, the real feature is the restraint.

Top comments (0)