DEV Community

Cover image for Sitemap Parser That Auto-Discovers from robots.txt
Alex Spinov
Alex Spinov

Posted on • Edited on

Sitemap Parser That Auto-Discovers from robots.txt

Most websites have sitemaps, but finding them can be tricky. Here's a parser that auto-discovers.

Discovery Logic

  1. Check robots.txt for Sitemap: directive
  2. Try common paths: /sitemap.xml, /sitemap_index.xml
  3. Parse XML with cheerio xmlMode
  4. Handle sitemap indexes recursively

Recursive Parsing

Sitemap indexes contain links to child sitemaps:

<sitemapindex>
  <sitemap><loc>https://site.com/sitemap-1.xml</loc></sitemap>
  <sitemap><loc>https://site.com/sitemap-2.xml</loc></sitemap>
</sitemapindex>
Enter fullscreen mode Exit fullscreen mode

The parser follows these recursively until it reaches actual URL entries.

What You Get from Each URL

{
  "url": "https://site.com/products/widget",
  "lastmod": "2026-03-20",
  "changefreq": "weekly",
  "priority": "0.8"
}
Enter fullscreen mode Exit fullscreen mode

Why Sitemaps Matter for Scrapers

Complete URL discovery. No need to crawl and guess. The sitemap IS the complete page index.

Change detection. lastmod tells you which pages updated since your last run. Only scrape what changed = 10x faster.

Content categorization. URL patterns reveal structure: /products/ vs /blog/ vs /docs/.

SEO competitive analysis. Total indexed pages, URL hierarchy, update frequency.

The Tool

Sitemap Scraper on Apify handles all of this automatically. Enter a domain, get every URL with metadata.

Combine with Robots.txt Analyzer for complete crawl intelligence.

Part of 77 free tools.

Custom site analysis — $20: Order via Payoneer


More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs
Also: Neon Free Postgres | Vercel Free API | Hetzner 4x More Server
NEW: I Ran an AI Agent for 16 Days — What Actually Works

Top comments (0)