DEV Community

Cover image for Your XML Sitemap Is Probably Broken — Here's How to Check and Fix It
Imtiaz ali
Imtiaz ali

Posted on

Your XML Sitemap Is Probably Broken — Here's How to Check and Fix It

`Most developers treat sitemaps as a "set it and forget it" thing. Generate once, submit to Google, move on. The problem: sitemaps break in ways you can't see. New pages are added with the wrong format. Deleted pages stay listed. HTTP URLs appear after an SSL migration. And Google quietly stops crawling large sections of your site.

This guide covers everything you need to know to validate, fix, and maintain an XML sitemap that actually does its job.


Why Your Sitemap Matters More Than You Think

A sitemap is not a ranking factor. Google won't rank you higher just because you have one. But a broken sitemap actively harms you — pages Google can't discover can't rank, regardless of how good the content is.

Sitemaps matter most for:

  • New sites — before you've built enough internal links for Google to find everything through crawling
  • Large sites (1,000+ pages) — Google's crawl budget is finite; a clean sitemap helps allocate it efficiently
  • Sites with weak internal linking — orphan pages that aren't linked from anywhere won't get crawled without a sitemap
  • After major URL changes — a sitemap tells Google exactly where things moved

The Correct XML Sitemap Structure

Here's a valid, minimal sitemap you can use as a template:

`xml
<?xml version="1.0" encoding="UTF-8"?>


https://example.com/
2025-01-15
weekly
1.0


https://example.com/about
2024-11-20
monthly
0.8


`

Required:

  • <?xml version="1.0" encoding="UTF-8"?> — must be the very first line
  • xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" — exact namespace, must match
  • <loc> — the page URL, inside every <url> block. This is the only required child element.

Optional (but useful):

  • <lastmod> — date of last meaningful content change, in YYYY-MM-DD format
  • <changefreq> — hint about update frequency (Google largely ignores this)
  • <priority> — relative priority 0.0–1.0 (Google largely ignores this too)

For Large Sites: The Sitemap Index

If you have more than 50,000 URLs or your sitemap exceeds 50MB uncompressed, you need multiple sitemaps referenced by an index file:

`xml
<?xml version="1.0" encoding="UTF-8"?>


https://example.com/sitemap-pages.xml
2025-01-15


https://example.com/sitemap-posts.xml
2025-01-14


`

This is also useful organizationally — separate sitemaps for pages, blog posts, products, and images let you diagnose which section has crawl issues quickly.


The 10 Most Common Sitemap Errors

1. Wrong or Missing Namespace

Broken:
xml
<urlset>

Fixed:
xml
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

The namespace must match exactly. Missing it or using a slightly different URL causes Google Search Console to reject the sitemap with a parsing error.


2. Missing XML Declaration

The file must start with this — not a blank line, not a space, this:

xml
<?xml version="1.0" encoding="UTF-8"?>


3. Relative URLs in <loc>

Broken:
xml
<loc>/about</loc>

Fixed:
xml
<loc>https://example.com/about</loc>

Every URL must be absolute, including the correct protocol.


4. HTTP URLs on an HTTPS Site

After migrating to SSL, your sitemap might still have http:// URLs. Google sees these as different pages from your https:// pages.

`xml

http://example.com/about

https://example.com/about
`


5. Wrong Date Format in <lastmod>

Broken:
xml
<lastmod>January 15, 2025</lastmod>
<lastmod>01/15/2025</lastmod>

Fixed:
xml
<lastmod>2025-01-15</lastmod>

Must be ISO 8601: YYYY-MM-DD. Google is lenient with this but other crawlers may reject non-standard formats.


6. Unescaped Special Characters in URLs

URLs containing & must be escaped in XML:

Broken:
xml
<loc>https://example.com/search?cat=1&page=2</loc>

Fixed:
xml
<loc>https://example.com/search?cat=1&amp;page=2</loc>

Also escape < as &lt; and > as &gt; if they appear in URLs (rare but possible).


7. Including Non-Indexable Pages

Your sitemap should only contain pages you want Google to index. Never include:

  • Pages with <meta name="robots" content="noindex">
  • Pages blocked by robots.txt
  • 404 error pages
  • Redirect URLs (only include the final destination)
  • Non-canonical versions of pages (duplicate content)

Including noindex pages in your sitemap sends Google a contradictory signal and wastes crawl budget.


8. Inflated <lastmod> Dates

Setting every page's <lastmod> to today's date to trick Google into recrawling is a known pattern — and Google knows it too. If your <lastmod> dates are consistently inaccurate, Google stops trusting them and ignores them entirely.

Only update <lastmod> when the page's content genuinely changed.


9. Exceeding the 50,000 URL or 50MB Limit

A single sitemap file is limited to 50,000 URLs and 50MB uncompressed. Beyond that, Google will process only part of it. Split into multiple files and use a sitemap index.


10. Sitemap Not Declared in robots.txt

Most crawlers check robots.txt first. Add this line to help them find your sitemap:

plaintext
Sitemap: https://example.com/sitemap.xml


Validate Your Sitemap Before Submitting

Don't discover these errors after Google's already tried (and failed) to parse your sitemap. Validate first.

OurToolkit's free XML Sitemap Validator checks your sitemap against 9 error types — missing namespace, relative URLs, wrong date formats, unclosed tags, HTTP on HTTPS sites, oversized files, and more. Paste your XML directly or enter your sitemap URL.

No account, no signup, instant results.


Where to Find Your Sitemap

Not sure where your sitemap lives? Try these in order:

  1. https://yourdomain.com/sitemap.xml
  2. https://yourdomain.com/sitemap_index.xml
  3. Check your robots.txt at https://yourdomain.com/robots.txt — look for a Sitemap: directive
  4. WordPress (Yoast): https://yourdomain.com/sitemap_index.xml
  5. WordPress (Rank Math): https://yourdomain.com/sitemap_index.xml
  6. Shopify: https://yourstore.myshopify.com/sitemap.xml (automatic, always exists)

Submitting to Google Search Console

Once your sitemap validates:

  1. Go to Google Search Console
  2. Select your property
  3. In the left sidebar → Sitemaps
  4. Enter your sitemap URL (just the path, e.g. sitemap.xml)
  5. Click Submit

Search Console will show:

  • How many URLs were submitted
  • How many Google successfully indexed
  • Any errors found during processing

Check back 48–72 hours after submission for initial results. After a site migration or major change, check weekly for 4–6 weeks.


After Submission: What to Monitor

In Search Console → Coverage report:

  • "Excluded" URLs — pages in your sitemap Google chose not to index (usually because of redirects, noindex, or canonicalization issues)
  • "Error" URLs — pages that returned 404 or 5xx when Google tried to crawl them

The gap between "submitted" and "indexed":
If you submitted 500 URLs but only 200 are indexed, Google is telling you it doesn't think the other 300 deserve indexing. This is a content quality signal, not a technical sitemap issue.

The sitemap is just the invitation. The content still has to be worth showing up for.


Running into specific sitemap errors in Search Console? Drop them in the comments — I read every one.
`

Top comments (0)