Most websites have sitemaps, but finding them can be tricky. Here's a parser that auto-discovers.
Discovery Logic
- Check
robots.txtforSitemap:directive - Try common paths:
/sitemap.xml,/sitemap_index.xml - Parse XML with cheerio xmlMode
- Handle sitemap indexes recursively
Recursive Parsing
Sitemap indexes contain links to child sitemaps:
<sitemapindex>
<sitemap><loc>https://site.com/sitemap-1.xml</loc></sitemap>
<sitemap><loc>https://site.com/sitemap-2.xml</loc></sitemap>
</sitemapindex>
The parser follows these recursively until it reaches actual URL entries.
What You Get from Each URL
{
"url": "https://site.com/products/widget",
"lastmod": "2026-03-20",
"changefreq": "weekly",
"priority": "0.8"
}
Why Sitemaps Matter for Scrapers
Complete URL discovery. No need to crawl and guess. The sitemap IS the complete page index.
Change detection. lastmod tells you which pages updated since your last run. Only scrape what changed = 10x faster.
Content categorization. URL patterns reveal structure: /products/ vs /blog/ vs /docs/.
SEO competitive analysis. Total indexed pages, URL hierarchy, update frequency.
The Tool
Sitemap Scraper on Apify handles all of this automatically. Enter a domain, get every URL with metadata.
Combine with Robots.txt Analyzer for complete crawl intelligence.
Part of 77 free tools.
Custom site analysis — $20: Order via Payoneer
More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs
Also: Neon Free Postgres | Vercel Free API | Hetzner 4x More Server
NEW: I Ran an AI Agent for 16 Days — What Actually Works
Top comments (0)