DEV Community

MORINAGA
MORINAGA

Posted on

@astrojs/sitemap generates /sitemap-0.xml not /sitemap-index.xml on small sites

What the smoke check found

This morning I ran a manual smoke check on all three Cloudflare Pages sites — aiappdex.com, findindiegame.com, ossfind.com — as part of investigating an unrelated robots.txt issue. Every robots.txt declared:

Sitemap: https://aiappdex.com/sitemap-index.xml
Enter fullscreen mode Exit fullscreen mode

Standard declaration. But hitting the URL directly returned 404. Not a redirect, not a permissions error — 404. I checked all three domains. Same result everywhere.

The sites launched 2026-04-23. They've been live for 15 days. Every crawl that honored robots.txt and tried to fetch the sitemap got a dead end. I don't know how much this affected indexing velocity — I'll have actual crawl coverage data in 30 days — but it's the kind of silent failure that validates doing manual checks even when automated pipelines look green.

The file that did exist was /sitemap-0.xml. Every site had that, with proper XML structure and valid URLs. It just wasn't at the path anything was looking for.

Why @astrojs/sitemap generates /sitemap-0.xml

The @astrojs/sitemap integration uses a chunked output format inherited from the sitemaps spec. A sitemap index (/sitemap-index.xml) is a container file that references individual sitemap chunks. The chunks are numbered sequentially: /sitemap-0.xml, /sitemap-1.xml, and so on.

The index file only gets generated when you have enough URLs to need more than one chunk. The default chunk size is 45,000 URLs. If your site is below that threshold — which describes every small and medium site — the integration generates /sitemap-0.xml only, with no index wrapper.

This is technically correct behavior. The spec allows a standalone sitemap file without an index. But the documentation examples and almost every "how to add a sitemap to Astro" tutorial tells you to put /sitemap-index.xml in robots.txt. That's the right declaration for large sites that use the multi-file format. For small sites, it points crawlers at a file that doesn't exist.

I set up the robots.txt files early on, copying the pattern from the integration docs. I never checked whether the generated build output matched those paths. That's on me.

The two-line Cloudflare Pages fix

The fastest fix for Cloudflare Pages is a _redirects file in public/. Cloudflare Pages processes this file at the edge; it supports transparent rewrites with status 200 (the content is served from the target path but the URL in the response stays consistent).

I added this to public/_redirects in all three apps:

/sitemap.xml         /sitemap-0.xml  200
/sitemap-index.xml   /sitemap-0.xml  200
Enter fullscreen mode Exit fullscreen mode

After the next deploy, both /sitemap.xml and /sitemap-index.xml return the actual sitemap content. Crawlers following robots.txt now get a real file instead of 404.

One thing I got wrong: I assumed Cloudflare Pages would hot-update edge rules without a deploy. It doesn't. The _redirects file is baked into the deployment artifact, so you need to push and wait for the build pipeline. That added about 4 minutes while I triple-checked whether the fix had taken effect before realizing I needed to trigger a build manually.

The rewrite approach is future-proof. If the sites eventually grow past 45,000 URLs, the integration will start generating a real sitemap-index.xml. At that point the _redirects rule becomes redundant — the real file exists at that path and Cloudflare serves it directly — and can be removed or left harmlessly in place.

How the IndexNow script amplified the problem

I added IndexNow URL submission to the CI pipeline earlier this week. The scripts/indexnow.mjs script runs after every article publish and after daily content refreshes. It fetches each site's sitemap, collects all URLs, and POSTs them to api.indexnow.org, which distributes to Bing, Yandex, Naver, and Seznam.

The script tries /sitemap-index.xml first, then falls back to /sitemap.xml:

for (const path of ["/sitemap-index.xml", "/sitemap.xml"]) {
  await walk(`https://${host}${path}`);
  if (urls.size > 0) break;
}
Enter fullscreen mode Exit fullscreen mode

Both paths were returning 404 before the fix. The fallback logic never found URLs. Each IndexNow ping was logging "no urls in sitemap, skipping" and exiting with a nonzero code — but the article publish step had already succeeded, so CI reported green overall.

That's a subtle failure mode: the IndexNow step is marked if: success() and errors in the ping don't fail the workflow. I did that intentionally — a failed ping shouldn't block article publishing — but it means a misconfigured sitemap produces silent no-ops in CI without alerting on anything. I'll add an explicit warning log when the collected URL count is zero, so it at least shows up visibly in the GitHub Actions output.

What I'd do differently next time

The _redirects rewrite is fine as a hotfix. The proper long-term options are:

Change robots.txt to declare /sitemap-0.xml directly. Works immediately, but you'd need to update it if the site ever scales past one chunk and the integration starts generating an index naturally.

Add a custom sitemap redirect in astro.config.mjs. Astro supports static redirect routes; putting the fix at the framework level keeps it out of hosting-specific config, though it has the same maintenance caveat.

Generate a real index file unconditionally post-build. There's no built-in option in @astrojs/sitemap to force index generation for small sites. You'd need a custom Vite plugin or build script that wraps /sitemap-0.xml in a minimal index file. This is the cleanest but takes more careful work — it's in my backlog.

For right now, three lines of _redirects config across three apps is the right call. The sites are 15 days old, traffic is effectively zero, and I'd rather spend the next hour on something that moves the needle more than sitemap plumbing.

The operational lesson is simple: after any deploy, actually fetch the URLs that robots.txt declares. If a path returns 404, you have this problem. It takes 30 seconds to verify with curl -I and is easy to miss entirely if you're only checking build logs.

I'll publish actual crawl coverage numbers at the 30-day mark. I don't know yet whether 15 days of 404 sitemaps meaningfully delayed indexing, or whether Googlebot discovered pages through other signals anyway. That's what the data will tell.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Top comments (0)