The Method to Crawl Sitemaps with Python

#webscraping

Websites hold treasures—thousands of URLs waiting to be uncovered. Instead of hopping from page to page, why not tap directly into the sitemap? Think of it as the website’s own roadmap, showing exactly which pages it wants search engines to find. This approach doesn’t just save time. It flips the script on traditional crawling.
However, sitemaps aren’t always simple. Many sites use index sitemaps that link out to multiple smaller ones. Some list thousands of URLs. Parsing these manually can become a maze of XML files and nested structures. Tedious, error-prone, and definitely not efficient.
Enter ultimate-sitemap-parser (usp) — a Python library built to handle these headaches. It does the heavy lifting by automatically fetching and parsing XML sitemaps, navigating nested index sitemaps with zero extra code, and extracting all URLs quickly with one simple function call. Sound good? Let’s show you exactly how to use it with the ASOS sitemap.

What You Need First

Python installed
If it’s missing, grab it from python.org. Check with:

python3 --version

ultimate-sitemap-parser installed
Run:

pip install ultimate-sitemap-parser

Crawl Every URL from ASOS in a Flash

This tiny snippet grabs every page URL from ASOS’s sitemap:

from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

for page in tree.all_pages():
    print(page.url)

Simple. Clean. Powerful. usp fetches the sitemap, parses it, and hands you every URL on a silver platter.

Handle Nested Sitemaps Without Lifting a Finger

Big sites break down sitemaps by sections — products here, blogs there, categories somewhere else. Normally, you’d have to dig through each one manually.

usp makes this effortless. It spots index sitemaps, fetches their children, and pulls URLs recursively. All in one go.

Filter Your URLs to Focus on What Matters

Want only product pages? Easy. If product URLs contain /product/, just filter:

product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]

for url in product_urls:
    print(url)

Instantly narrow your focus. No fluff.

Save URLs for Later Analysis

Printing URLs is great. Saving them? Even better. Export to CSV like this:

import csv
from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)
urls = [page.url for page in tree.all_pages()]

csv_filename = "asos_sitemap_urls.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["URL"])
    for url in urls:
        writer.writerow([url])

print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")

Now your data is ready for whatever comes next.

Final Thoughts

ultimate-sitemap-parser transforms sitemap crawling from a chore into a breeze. It eliminates the XML complexity and handles nested sitemaps automatically. If you’re doing SEO analysis, web scraping, or website audits, usp is a must-have tool.