Sitemap Parser That Auto-Discovers from robots.txt

#seo #xml #javascript #webdev

Most websites have sitemaps, but finding them can be tricky. Here's a parser that auto-discovers.

Discovery Logic

Check robots.txt for Sitemap: directive
Try common paths: /sitemap.xml, /sitemap_index.xml
Parse XML with cheerio xmlMode
Handle sitemap indexes recursively

Recursive Parsing

Sitemap indexes contain links to child sitemaps:

<sitemapindex>
  <sitemap><loc>https://site.com/sitemap-1.xml</loc></sitemap>
  <sitemap><loc>https://site.com/sitemap-2.xml</loc></sitemap>
</sitemapindex>

The parser follows these recursively until it reaches actual URL entries.

What You Get from Each URL

{
  "url": "https://site.com/products/widget",
  "lastmod": "2026-03-20",
  "changefreq": "weekly",
  "priority": "0.8"
}

Why Sitemaps Matter for Scrapers

Complete URL discovery. No need to crawl and guess. The sitemap IS the complete page index.

Change detection. lastmod tells you which pages updated since your last run. Only scrape what changed = 10x faster.

Content categorization. URL patterns reveal structure: /products/ vs /blog/ vs /docs/.

SEO competitive analysis. Total indexed pages, URL hierarchy, update frequency.

The Tool

Sitemap Scraper on Apify handles all of this automatically. Enter a domain, get every URL with metadata.

Combine with Robots.txt Analyzer for complete crawl intelligence.

Part of 77 free tools.

Custom site analysis — $20: Order via Payoneer

DEV Community