Building a Sitemap Health Checker: Discovery, Index Files, Parallel Sampling

#architecture #performance #showdev #webdev

We built the Sitemap Checker around a budget before we wrote a line of parsing code. The tool is free, it takes anonymous users with no signup, and someone pasting a domain expects an answer in under a minute, not a progress bar they walk away from. That single constraint decided most of the architecture, because it ruled out the obvious approach.

The obvious approach is to crawl. Fetch the sitemap, request every URL it lists, report what is broken. That works fine for a 40-page brochure site and falls apart everywhere else. A sitemap can hold up to 50,000 URLs per file, and an index can point at many such files. Requesting all of them, even fast, blows the time budget and hammers the target server with traffic it did not ask for. So full crawls were off the table by design, not as a limitation we apologize for. The interesting work was figuring out what you can learn about a sitemap's health without reading every page in it.

Finding the sitemap before you can check it

You cannot check a file you cannot locate, and there is no single guaranteed address for a sitemap. So discovery runs as an ordered chain, and the order encodes which source to trust.

First we read robots.txt and look for a Sitemap: directive. This is the authoritative answer. If the site owner declared where their sitemap lives, that declaration wins, because they know something a convention does not: where they actually put it. A Sitemap: line can point anywhere, including a path you would never guess.

If robots.txt names nothing, we fall back to the two conventional locations in turn: /sitemap.xml, then /sitemap_index.xml. These are guesses, but they are good guesses, because most generators emit one or the other by default.

The order matters because the alternative is wrong. If you check /sitemap.xml first and find a stale leftover, you would report on the wrong file while the real one sits declared in robots.txt, unread. Declared beats conventional, conventional beats nothing, and the chain stops at the first hit.

Two XML shapes, one of them recursive

Once you have a sitemap, it is one of two shapes, and you do not know which until you parse the root element.

A <urlset> is a flat sitemap: a list of <url> entries, each with a <loc> and optionally <lastmod>, <changefreq>, and <priority>. You read the entries and you are done.

A <sitemapindex> is an index: a list of <sitemap> entries, each pointing at a child sitemap that is itself a <urlset>. To get the actual URLs you have to fetch the children and parse each one. An index can list dozens of children, and fetching all of them serially blows the budget.

So we cap it: for an index, we fetch up to five child sitemaps and parse those. Five is enough to characterize the sitemap's health and freshness without turning a one-minute check into a five-minute one. A site whose first five child sitemaps are clean and a site whose first five are full of broken entries are telling you very different things, and you do not need all forty to know which you are looking at. From whichever shape we land on, we extract loc, lastmod, changefreq, and priority per URL, which is everything the later checks run on.

Sampling instead of crawling

The liveness check is where the budget constraint becomes a statistics decision.

We do not request every URL. We take a random sample of min(50, total) URLs and check those. Fifty for any sitemap large enough to have fifty; the whole thing if it is smaller. Each sampled URL gets a HEAD request, run through a pool of ten workers in parallel so the fifty checks finish in roughly the time of the slowest five, not the sum of all fifty.

HEAD rather than GET because we only want the status code, not the body. We sort each response into one of three buckets: healthy for a 2xx, redirected for a 3xx, broken for a 4xx or 5xx. The broken ones come back with their status codes attached, because "broken" alone is not actionable and "404 on /old-page" is.

The framing to be honest about: a random fifty-URL sample bounds the broken-rate estimate well enough to assign a health grade. If 4 of 50 sampled URLs are broken, the file is in a different state than one where 0 of 50 are, and that difference is what a grade should capture. What the sample does not do is enumerate every broken URL in a 50,000-entry file. It is a measurement of the rate, not a complete inventory of the failures, and we say so rather than implying the broken list is exhaustive.

Servers that lie to HEAD requests

Here is the mess the clean version of this story leaves out: not every server answers a HEAD request honestly.

A HEAD is supposed to return exactly what a GET would, minus the body, so the status code should be identical. Plenty of servers do not implement it that way. Some reject HEAD outright with a 405, some return a 403, some behave differently than they would for the GET that a real visitor sends. So a URL that serves a perfectly good page to a browser can come back as broken in our sample purely because the server mishandles the method we used to probe it. That is a false negative, and we cannot fully eliminate it without sending full GETs, which puts us back over the time budget.

What saves the grade from this is the scoring shape. We never let a single failed check zero anything, because the score is built on the broken ratio, not a broken count. The liveness contribution scales as one minus the broken ratio. A couple of false negatives in a fifty-URL sample nudge the ratio slightly; they do not collapse the score. A sitemap that is genuinely fine but lives behind a HEAD-hostile server loses a few points, not the whole grade. The math degrades gracefully on purpose, because the failure mode we most wanted to avoid was confidently telling someone their healthy site is broken.

The rubric, in full

The score is additive to 100, and every component maps to something the checks above actually measured:

Sitemap found: +20. The file exists and was located. Discoverability is the precondition for everything else, so it earns real points on its own.
lastmod coverage × 20. The fraction of URLs carrying a real last-modified date, scaled to 20. Full coverage is the full 20; half coverage is 10.
Has date info: +15. A flat bonus when the file carries date metadata at all, separate from how complete that coverage is.
(1 minus broken ratio) × 20. The liveness term. A clean sample earns the full 20; the score scales down with the broken rate rather than dropping off a cliff.
No duplicates: +10. No URL listed more than once.
Within the 50,000 size limit: +15. No single file exceeds the protocol's per-file ceiling.

That sums to 100: 20 + 20 + 15 + 20 + 10 + 15. The choice worth defending is the +20 for simply being found. It looks generous for a check that did no real work yet. But discovery is the hardest gate to clear: a sitemap a crawler cannot find is worth zero regardless of how pristine its contents are. Awarding points for "found at all" reflects that the rest of the rubric is meaningless until this passes, so it deserves weight, not a footnote.

Try it

The Sitemap Checker is free and takes a root domain, no signup. You get the totals, the lastmod coverage, the freshest and stalest entries, the sampled broken list with status codes, and the score broken down by the components above. The companion on the permission side is the AI Crawler Checker, which reads robots.txt to tell you which AI bots are allowed in at all.

If you build sitemap tooling yourself, the one decision we would press on is the sampling cap. It is tempting to treat fifty as a placeholder and crank it to "be thorough." Resist that. The reason fifty holds up is that you are estimating a rate, and the precision of a rate estimate improves with the square root of the sample, so going from 50 to 500 buys you a factor of about three in precision for ten times the requests and ten times the load on someone else's server. The honest version of thorough is a sample sized to the question you are answering, not the largest number your timeout allows. We picked fifty and a one-minute budget on purpose, and the rubric was designed to mean something inside those limits rather than pretend they do not exist.

Mehul Jain is an AI entrepreneur and product builder. He works on Geology, a GEO platform.