DEV Community

Alex Spinov
Alex Spinov

Posted on

Every Website Has a Secret Index — How to Use Sitemaps for Smarter Scraping

Every website has a sitemap.xml. It's a complete index of all pages the site wants search engines to find. For scraping, it's a roadmap.

Finding the Sitemap

https://example.com/sitemap.xml
https://example.com/sitemap_index.xml
https://example.com/robots.txt  # Often lists sitemap URL
Enter fullscreen mode Exit fullscreen mode

Most sites follow these conventions. If not, check robots.txt — it usually contains a Sitemap: directive.

What Sitemaps Contain

<url>
  <loc>https://example.com/product/widget-pro</loc>
  <lastmod>2026-03-20</lastmod>
  <changefreq>weekly</changefreq>
  <priority>0.8</priority>
</url>
Enter fullscreen mode Exit fullscreen mode
  • loc — every URL the site considers important
  • lastmod — when the page was last updated
  • priority — how important the site thinks each page is

Why This Matters for Scrapers

Complete URL discovery. Instead of crawling and guessing links, the sitemap tells you exactly which pages exist.

Change detection. lastmod shows which pages changed recently. Only scrape updated pages = faster, cheaper runs.

Content categorization. URL patterns reveal site structure: /products/, /blog/, /category/ — know what to extract.

SEO analysis. Total indexed pages, URL hierarchy, update frequency — competitive intelligence gold.

Tools

All on Apify Store.

Custom site analysis — $20: Order via Payoneer

Top comments (0)