How to Extract Clean Content From Any Website Sitemap
Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?
I built a Sitemap Content Extractor that does exactly this — feed it a sitemap.xml URL and it crawls every page, extracting structured content.
What It Does
- Parses sitemap indexes — follows nested sitemaps recursively
-
Handles gzip sitemaps —
.xml.gzfiles work out of the box - Extracts full content — clean article text using trafilatura
- Captures metadata — title, meta description, meta keywords, H1 headings
- Word counts — for every page
- URL filtering — include/exclude patterns via regex
How to Use It
You can run it directly on Apify Store — no setup required.
Just provide:
- A sitemap URL (e.g.,
https://example.com/sitemap.xml) - Max URLs to process
- Whether to extract full content
Example Output
{
"url": "https://pydantic.dev/docs/",
"title": "Pydantic Docs - Validation, AI Agents, Logfire Observability",
"content": "Full extracted article text...",
"wordCount": 131,
"metaDescription": "Pydantic documentation...",
"h1Headings": ["Pydantic Docs"],
"lastmod": "2025-01-15",
"extractedAt": "2026-07-04T10:45:00Z"
}
Real-World Use Cases
1. SEO Content Audits
Crawl your entire site and identify pages with:
- Missing or duplicate meta descriptions
- Short content (under 300 words)
- Missing H1 tags
- Stale content (old
lastmoddates)
2. AI Training Data Collection
Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.
3. Competitor Analysis
Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.
4. Content Migration
Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.
Technical Details
The extractor is built in Python 3.12 and uses:
- trafilatura for main content extraction (better than BeautifulSoup for article text)
- lxml for sitemap XML parsing
- BeautifulSoup for metadata extraction
- Apify SDK for infrastructure and scaling
It handles both <urlset> (regular sitemaps) and <sitemapindex> (nested sitemaps), following child sitemaps recursively.
Get Started
Try it now on the Apify Store
No registration needed — just paste a sitemap URL and hit run.
What would you use a sitemap extractor for? Let me know in the comments!
Top comments (0)