Oaida Adrian

Posted on Jul 4 • Originally published at apify.com

How to Extract Clean Content From Any Website Sitemap (For SEO Audits & AI Training)

#webdev #automation #python #seo

How to Extract Clean Content From Any Website Sitemap

Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?

I built a Sitemap Content Extractor that does exactly this — feed it a sitemap.xml URL and it crawls every page, extracting structured content.

What It Does

Parses sitemap indexes — follows nested sitemaps recursively
Handles gzip sitemaps — .xml.gz files work out of the box
Extracts full content — clean article text using trafilatura
Captures metadata — title, meta description, meta keywords, H1 headings
Word counts — for every page
URL filtering — include/exclude patterns via regex

How to Use It

You can run it directly on Apify Store — no setup required.

Just provide:

A sitemap URL (e.g., https://example.com/sitemap.xml)
Max URLs to process
Whether to extract full content

Example Output

{
  "url": "https://pydantic.dev/docs/",
  "title": "Pydantic Docs - Validation, AI Agents, Logfire Observability",
  "content": "Full extracted article text...",
  "wordCount": 131,
  "metaDescription": "Pydantic documentation...",
  "h1Headings": ["Pydantic Docs"],
  "lastmod": "2025-01-15",
  "extractedAt": "2026-07-04T10:45:00Z"
}

Real-World Use Cases

1. SEO Content Audits

Crawl your entire site and identify pages with:

Missing or duplicate meta descriptions
Short content (under 300 words)
Missing H1 tags
Stale content (old lastmod dates)

2. AI Training Data Collection

Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.

3. Competitor Analysis

Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.

4. Content Migration

Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.

Technical Details

The extractor is built in Python 3.12 and uses:

trafilatura for main content extraction (better than BeautifulSoup for article text)
lxml for sitemap XML parsing
BeautifulSoup for metadata extraction
Apify SDK for infrastructure and scaling

It handles both <urlset> (regular sitemaps) and <sitemapindex> (nested sitemaps), following child sitemaps recursively.

DEV Community