DEV Community

Gregory
Gregory

Posted on

How to Fix "Couldn't Fetch" in Google Search Console for a Next.js Sitemap

For about a week, Google Search Console showed "Couldn't fetch" on my sitemap. The file loaded fine in a browser, returned 200 OK to curl, validated as XML, and Googlebot's user agent saw the same response. Nothing was wrong. Except something was clearly wrong, because pages weren't being discovered.

The actual problem turned out to have nothing to do with my server, my hosting, or my XML structure. It was a single category of characters inside the URLs themselves. Here's the full debugging path, because I wish someone had written this post a week ago.

The setup

I'm running YumCha, a Cantonese learning site. The web side is Next.js 16 with PayloadCMS (Postgres) and a 124,000-entry Cantonese dictionary. Each dictionary entry is its own page at /cantonese-dictionary/[slug], so the sitemap is genuinely large.

Initial sitemap implementation used the Next.js metadata route convention:

// src/app/sitemap.ts
import { getDictionaryEntries } from "@/lib/dictionary";

export default async function sitemap() {
  const entries = await getDictionaryEntries();
  return entries.map((e) => ({
    url: `https://www.yumcha.fun/cantonese-dictionary/${e.slug}`,
    changeFrequency: "monthly",
    priority: 0.5,
  }));
}
Enter fullscreen mode Exit fullscreen mode

That returned ~50,000 URLs in a single file. Google's per-file limit is 50,000 URLs, so I was right at the edge. Time to split.

Attempt 1: generateSitemaps()

Next.js supports splitting a sitemap natively:

export async function generateSitemaps() {
  const count = await getDictionaryCount();
  const chunks = Math.ceil(count / 40000);
  return Array.from({ length: 1 + chunks }, (_, i) => ({ id: i }));
}

export default async function sitemap({ id }: { id: number }) {
  if (id === 0) return coreSitemap();
  return dictionaryChunk(id - 1);
}
Enter fullscreen mode Exit fullscreen mode

This generates /sitemap/0.xml, /sitemap/1.xml, etc. — and according to the docs, an index at /sitemap.xml that lists them all.

Except the index didn't exist. Sub-sitemaps returned 200 fine. The index returned 404. Verified locally with next start after a clean build: /sitemap.xml is genuinely missing when you use generateSitemaps() together with force-dynamic. (The behavior may also affect static mode in Turbopack — I observed 404 in both cases on Next.js 16.2.3.)

I tried adding a route handler at app/sitemap.xml/route.ts to manually serve the index. Build error:

Conflicting route and metadata at /sitemap.xml: route at /sitemap.xml/route and metadata at /sitemap.xml/route
Enter fullscreen mode Exit fullscreen mode

Next.js reserves the path even though it doesn't actually serve it. The metadata route claims the URL but only fills sub-sitemap slots, leaving the index empty.

Attempt 2: full manual control

I deleted app/sitemap.ts and replaced it with two route handlers:

app/sitemap.xml/route.ts           sitemap index
app/sitemap/[id]/route.ts          individual sub-sitemaps
Enter fullscreen mode Exit fullscreen mode

Plus a small XML helper library so both routes share generation logic:

// src/lib/sitemap-xml.ts
export function urlsetXml(urls: UrlEntry[]): string {
  const entries = urls
    .map((u) => {
      const parts = [`    <loc>${xmlEscape(u.url)}</loc>`];
      if (u.lastmod) parts.push(`    <lastmod>${u.lastmod.toISOString()}</lastmod>`);
      if (u.priority !== undefined) parts.push(`    <priority>${u.priority}</priority>`);
      return `  <url>\n${parts.join("\n")}\n  </url>`;
    })
    .join("\n");
  return `<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
${entries}
</urlset>`;
}
Enter fullscreen mode Exit fullscreen mode

The dynamic route accepts both /sitemap/0 and /sitemap/0.xml:

const idStr = raw.replace(/\.xml$/, "");
const id = Number(idStr);
Enter fullscreen mode Exit fullscreen mode

Build succeeded. /sitemap.xml returned 200 with a proper sitemap index. Sub-sitemaps returned 200 with valid urlset XML. xmllint validated everything. Googlebot's user agent saw the same. I deployed.

GSC still said "Couldn't fetch."

The real culprit

Out of frustration I started looking at the URLs themselves. Out of 71,574 dictionary URLs, most looked clean — /cantonese-dictionary/jat1-one-or-two-10 style slugs. But running the file through grep -P '[^\x00-\x7F]' surfaced something:

/cantonese-dictionary/凝:jing4-focus-one-s-eyes-upon-2429
/cantonese-dictionary/ngaam1feel-to-feel-like-someone-2025
/cantonese-dictionary/aa3sir-the-appellation-for-a-police-19004
/cantonese-dictionary/noi6…m4-cannot-intervene-23166
Enter fullscreen mode Exit fullscreen mode

The dictionary's seed script had preserved Chinese characters, full-width Latin (those are real Unicode codepoints, not styled letters), and the Unicode horizontal ellipsis from CC-Canto data. The slugs went into Postgres unmodified, and the sitemap dumped them into <loc> elements as raw UTF-8.

Per the sitemap protocol spec:

The URL [in the loc element] must be URL-encoded.

Google fetches the file, parses the XML, encounters non-URL-encoded characters in <loc>, and treats the entire sitemap as malformed. The error surfaces in Search Console as "Couldn't fetch" — which sounds like a network error but is actually a parse rejection. xmllint, Chrome, and curl all accept the file fine because they're more permissive than Google's sitemap parser.

There were only 22 of these out of 124k entries. But 22 is enough.

The two-prong fix

First, exclude the malformed slugs at the database query level. They actually 404 in Next.js routing too (Chinese characters in dynamic segments don't match cleanly), so they were dead URLs anyway:

// src/lib/dictionary.ts
sql`SELECT slug FROM dictionary_entries
    WHERE ((example_traditional IS NOT NULL AND example_traditional <> '')
       OR LENGTH(english) > 20)
      AND slug ~ '^[a-z0-9-]+$'
    ORDER BY id
    LIMIT ${limit} OFFSET ${offset}`
Enter fullscreen mode Exit fullscreen mode

The regex filter (POSIX, with - at the end of the character class to avoid escaping) keeps only slugs that are pure ASCII lowercase + digits + hyphens.

Second, URL-encode the value in the XML helper as a defensive measure for any future fields. Order matters here — encodeURI keeps & raw, then xmlEscape turns it into &amp;:

function encodeUrlForXml(url: string): string {
  return xmlEscape(encodeURI(url));
}
Enter fullscreen mode Exit fullscreen mode

After deploy: 71,574 URLs (vs. 71,588 before — 14 dropped from the kept set, plus 8 already excluded by other filters), zero non-ASCII URLs in the output, valid XML. Now waiting on Google to re-fetch.

Takeaways

  • "Couldn't fetch" in GSC is not always a network error. It can be a parser rejection that the rest of the world accepts. If your sitemap loads everywhere except in Google, suspect content validation.

  • generateSitemaps() + force-dynamic doesn't generate the index in Next.js 16. Sub-sitemaps work, but /sitemap.xml is silently missing. Verify with curl after deploy. If you need dynamic sitemaps with an index, use manual route handlers instead.

  • <loc> values must be URL-encoded. Even if your slugs look fine in the database, audit them. A grep for [^\x00-\x7F] against your live sitemap takes seconds and would have saved me a week.

  • Filter dead URLs at query time. If a slug 404s, don't put it in your sitemap. Saves crawl budget and avoids exactly this class of parser rejection.

Sitemap saga over. Now I just have to wait for Google to refetch. Time will tell.


I'm building YumCha, a Cantonese learning app. Web side is Next.js 16 + PayloadCMS, mobile is Expo, backend is NestJS.

Top comments (0)