DEV Community

Oaida Adrian
Oaida Adrian

Posted on • Originally published at apify.com

How to Extract Clean Content From Any Website Sitemap (For SEO Audits & AI Training)

How to Extract Clean Content From Any Website Sitemap

Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?

I built a Sitemap Content Extractor that does exactly this — feed it a sitemap.xml URL and it crawls every page, extracting structured content.

What It Does

  • Parses sitemap indexes — follows nested sitemaps recursively
  • Handles gzip sitemaps.xml.gz files work out of the box
  • Extracts full content — clean article text using trafilatura
  • Captures metadata — title, meta description, meta keywords, H1 headings
  • Word counts — for every page
  • URL filtering — include/exclude patterns via regex

How to Use It

You can run it directly on Apify Store — no setup required.

Just provide:

  • A sitemap URL (e.g., https://example.com/sitemap.xml)
  • Max URLs to process
  • Whether to extract full content

Example Output

{
  "url": "https://pydantic.dev/docs/",
  "title": "Pydantic Docs - Validation, AI Agents, Logfire Observability",
  "content": "Full extracted article text...",
  "wordCount": 131,
  "metaDescription": "Pydantic documentation...",
  "h1Headings": ["Pydantic Docs"],
  "lastmod": "2025-01-15",
  "extractedAt": "2026-07-04T10:45:00Z"
}
Enter fullscreen mode Exit fullscreen mode

Real-World Use Cases

1. SEO Content Audits

Crawl your entire site and identify pages with:

  • Missing or duplicate meta descriptions
  • Short content (under 300 words)
  • Missing H1 tags
  • Stale content (old lastmod dates)

2. AI Training Data Collection

Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.

3. Competitor Analysis

Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.

4. Content Migration

Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.

Technical Details

The extractor is built in Python 3.12 and uses:

  • trafilatura for main content extraction (better than BeautifulSoup for article text)
  • lxml for sitemap XML parsing
  • BeautifulSoup for metadata extraction
  • Apify SDK for infrastructure and scaling

It handles both <urlset> (regular sitemaps) and <sitemapindex> (nested sitemaps), following child sitemaps recursively.

Get Started

Try it now on the Apify Store

No registration needed — just paste a sitemap URL and hit run.


What would you use a sitemap extractor for? Let me know in the comments!

Top comments (0)