137Foundry

Posted on Jul 3

How to Build a Sitemap-vs-Indexed URL Diff Script in Python

#python #seo #productivity #webdev

The core diagnostic in a sitemap audit is: which URLs in the sitemap are not indexed by Google, and which URLs Google has indexed are not in the sitemap. Answer that in a table and the fix list writes itself.

You can eyeball this in Search Console for a small site. For anything over a few hundred URLs, a Python script is faster, produces a spreadsheet you can hand to a stakeholder, and re-runs on demand as the audit progresses. Here is how to build a minimal version in about 50 lines.

Photo by Ron Lach on Pexels

What the script does

Input:

A sitemap URL (or a sitemap-index URL)
An export of the Search Console Pages report, filtered to the same sitemap, downloaded as CSV

Output:

A CSV listing every URL, with columns for "in sitemap," "indexed by Google," and "gap category" (either "sitemap only," "indexed only," or "both")

That output is your working document. The "sitemap only" rows are candidates for either indexation work or removal from the sitemap. The "indexed only" rows are pages Google is indexing that the sitemap does not know about - often a sign of legacy URLs, canonical mistakes, or content the CMS is publishing without registering.

The dependencies

You need requests and the standard library. That is it.

pip install requests

Step one: fetch and flatten the sitemap

Sitemaps can be either a single file listing URLs directly, or an index file that lists other sitemap files. The script has to handle both.

import requests
import xml.etree.ElementTree as ET
from urllib.parse import urlparse

SITEMAP_NS_URI = "http" + "://www.sitemaps.org/schemas/sitemap/0.9"
NS = {"sm": SITEMAP_NS_URI}

def fetch_urls_from_sitemap(sitemap_url):
    urls = set()
    response = requests.get(sitemap_url, timeout=30)
    root = ET.fromstring(response.content)

    # Sitemap index: list of child sitemaps
    for child_sitemap in root.findall(".//sm:sitemap/sm:loc", NS):
        urls |= fetch_urls_from_sitemap(child_sitemap.text.strip())

    # Direct URL entries
    for url_loc in root.findall(".//sm:url/sm:loc", NS):
        urls.add(url_loc.text.strip())

    return urls

The recursion handles arbitrarily nested sitemap indexes, which some large sites use. The set ensures duplicate URLs across child sitemaps only count once. The timeout keeps a hung sitemap from blocking the whole script.

Step two: read the Search Console CSV

Google Search Console exports Pages report data as CSV. Download the "All Pages" export (the button is in the upper right of the report), then read it in Python.

import csv

def read_gsc_indexed_urls(csv_path):
    indexed = set()
    with open(csv_path, encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            # GSC exports URL as the first column, typically named "URL"
            url = row.get("URL") or row.get("Page")
            if url:
                indexed.add(url.strip())
    return indexed

If your Search Console export column name differs (some regions use localized column names), print reader.fieldnames to see what to reference.

Step three: compute the diff

Set operations do the work.

def compute_diff(sitemap_urls, indexed_urls):
    both = sitemap_urls & indexed_urls
    sitemap_only = sitemap_urls - indexed_urls
    indexed_only = indexed_urls - sitemap_urls
    return {
        "both": both,
        "sitemap_only": sitemap_only,
        "indexed_only": indexed_only,
    }

Note the subtle URL-matching gotcha: Search Console usually reports URLs with the canonical hostname (including https and trailing slash rules), while sitemaps can occasionally include variants (http vs https, with vs without trailing slash). Normalize both sets before diffing:

def normalize_url(url):
    parsed = urlparse(url)
    # Force https, lowercase host, remove trailing slash from non-root paths
    scheme = "https"
    netloc = parsed.netloc.lower()
    path = parsed.path
    if path != "/" and path.endswith("/"):
        path = path[:-1]
    return f"{scheme}://{netloc}{path}"

sitemap_urls_normalized = {normalize_url(u) for u in sitemap_urls}
indexed_urls_normalized = {normalize_url(u) for u in indexed_urls}

Without this step, the diff will produce false positives in both directions - URLs that are actually the same page appearing in "sitemap only" or "indexed only" because of a trailing slash.

Step four: write the output CSV

def write_diff_csv(diff, output_path):
    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["url", "in_sitemap", "indexed", "gap_category"])
        for url in sorted(diff["both"]):
            writer.writerow([url, "yes", "yes", "both"])
        for url in sorted(diff["sitemap_only"]):
            writer.writerow([url, "yes", "no", "sitemap_only"])
        for url in sorted(diff["indexed_only"]):
            writer.writerow([url, "no", "yes", "indexed_only"])

Open the output in a spreadsheet, sort by gap category, and start working through the "sitemap_only" rows first.

Step five: main

if __name__ == "__main__":
    SITEMAP_URL = "https" + "://yoursite.example/sitemap.xml"
    GSC_EXPORT = "gsc_pages_export.csv"
    OUTPUT = "sitemap_diff.csv"

    sitemap_urls = {normalize_url(u) for u in fetch_urls_from_sitemap(SITEMAP_URL)}
    indexed_urls = {normalize_url(u) for u in read_gsc_indexed_urls(GSC_EXPORT)}
    diff = compute_diff(sitemap_urls, indexed_urls)
    write_diff_csv(diff, OUTPUT)

    print(f"Sitemap: {len(sitemap_urls)} URLs")
    print(f"Indexed: {len(indexed_urls)} URLs")
    print(f"Both:    {len(diff['both'])} URLs")
    print(f"Sitemap only: {len(diff['sitemap_only'])} URLs")
    print(f"Indexed only: {len(diff['indexed_only'])} URLs")

Total script: about 50 lines. Runs in seconds on any reasonably-sized sitemap.

What to do with the output

The three columns in the output CSV correspond to three action buckets.

Both (in sitemap and indexed). These are healthy. No action.

Sitemap only (in sitemap, not indexed). These are the URLs Google decided not to index despite the sitemap listing them. Investigate. Either fix the URL (thin content, missing canonical, no internal links) or remove it from the sitemap.

Indexed only (indexed but not in sitemap). These are URLs Google is indexing that the sitemap does not know about. Almost always this indicates a canonical mistake, legacy URLs that should be redirected, or CMS-generated URLs the sitemap generator is not aware of. Fix the underlying issue.

The full audit workflow that this script enables is written up in the 137Foundry guide on how to audit an XML sitemap so Google indexes only ranked pages, including the decision tree for each gap category.

Extensions worth adding later

Three enhancements this minimal script does not have but are worth adding once the workflow is proven:

Bulk URL Inspection API calls to fetch the specific "not indexed" reason from Google for each sitemap_only URL. The Google Search Central developer docs cover auth and quotas.
On-page content fetching to pull word count, title, and canonical tag for each URL so the audit spreadsheet has enough context for you to categorize URLs without opening each one.
Historical tracking by re-running the script weekly and diffing the outputs so you can see progress over time.

Even without those, the minimal 50-line version produces the diagnostic that makes the rest of the audit possible.

Photo by Anna Tarazevich on Pexels

When to build your own vs use a paid tool

Paid tools (Screaming Frog, Ahrefs Site Audit, Sitebulb) all include sitemap-vs-indexed comparisons. They are faster to set up than writing your own script.

Build your own when:

You need to run the audit repeatedly on a schedule and integrate the output with other data (backlinks, analytics, custom scoring).
You need to normalize URLs in ways the paid tool does not (e.g., stripping tracking parameters, handling subdomain moves).
The site is small enough that the paid tool overhead is not worth it.

Otherwise the paid tool wins on time. The script above is for the cases where you want the output to fit into a bigger workflow. For sitemap-audit consulting engagements 137Foundry runs, the script is usually the first artifact we ship the client so they can re-run the audit themselves in three months.

For the full technical SEO service that includes running audits like this at scale, and for related writeups on how the sitemap audit fits into a broader indexation workflow.

Fifty lines of Python is enough to answer the diagnosis question. Everything after is either follow-up work or automation.

DEV Community