DEV Community: Oaida Adrian

How to Extract Clean Content From Any Website Sitemap (For SEO Audits & AI Training)

Oaida Adrian — Sat, 04 Jul 2026 10:50:51 +0000

How to Extract Clean Content From Any Website Sitemap

Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?

I built a Sitemap Content Extractor that does exactly this — feed it a sitemap.xml URL and it crawls every page, extracting structured content.

What It Does

Parses sitemap indexes — follows nested sitemaps recursively
Handles gzip sitemaps — .xml.gz files work out of the box
Extracts full content — clean article text using trafilatura
Captures metadata — title, meta description, meta keywords, H1 headings
Word counts — for every page
URL filtering — include/exclude patterns via regex

How to Use It

You can run it directly on Apify Store — no setup required.

Just provide:

A sitemap URL (e.g., https://example.com/sitemap.xml)
Max URLs to process
Whether to extract full content

Example Output

{
  "url": "https://pydantic.dev/docs/",
  "title": "Pydantic Docs - Validation, AI Agents, Logfire Observability",
  "content": "Full extracted article text...",
  "wordCount": 131,
  "metaDescription": "Pydantic documentation...",
  "h1Headings": ["Pydantic Docs"],
  "lastmod": "2025-01-15",
  "extractedAt": "2026-07-04T10:45:00Z"
}

Real-World Use Cases

1. SEO Content Audits

Crawl your entire site and identify pages with:

Missing or duplicate meta descriptions
Short content (under 300 words)
Missing H1 tags
Stale content (old lastmod dates)

2. AI Training Data Collection

Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.

3. Competitor Analysis

Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.

4. Content Migration

Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.

Technical Details

The extractor is built in Python 3.12 and uses:

trafilatura for main content extraction (better than BeautifulSoup for article text)
lxml for sitemap XML parsing
BeautifulSoup for metadata extraction
Apify SDK for infrastructure and scaling

It handles both <urlset> (regular sitemaps) and <sitemapindex> (nested sitemaps), following child sitemaps recursively.

Get Started

Try it now on the Apify Store

No registration needed — just paste a sitemap URL and hit run.

What would you use a sitemap extractor for? Let me know in the comments!

Scraping 187,000 Romanian Businesses: Building a B2B Lead Generation Tool

Oaida Adrian — Sat, 04 Jul 2026 10:37:34 +0000

I needed Romanian B2B leads and couldn't find a good scraper for local business directories. So I built one.

The Problem

Most lead generation tools focus on the US and Western European markets. If you're doing business in Romania or Eastern Europe, you're stuck with:

Manual directory browsing
US-centric tools that don't understand local directory structures
Outdated databases with stale contacts

The Solution

I built a Romanian Business Directory Scraper that works with listafirme.ro — one of Romania's largest business registries with 187,000+ companies in Bucharest alone.

What It Extracts

For each company, the scraper pulls:

Company name (Denumire)
CUI — Romanian tax identification number
Trade register number (Nr. Reg. Com.)
Full address — Street, city, county (județ)
CAEN code — Business activity classification
Founding date
VAT status — Plătitor/neplătitor de TVA

Sample Output

{
  "companyName": "BORG DESIGN SRL",
  "cui": "RO14837428",
  "tradeRegister": "J40/8118/2002",
  "address": "Str. Ing. Stefan Hepites 16A",
  "city": "Sectorul 5",
  "county": "Bucuresti",
  "category": "Proiectarea structurii și conținutului website...",
  "foundedDate": "2002-08-26"
}

Coverage

41 counties (județe) supported
187,009 companies in București alone
Pagination handled automatically (3,741 pages for București)
Detail page extraction for full company data

Use Cases

B2B Lead Generation — Build targeted contact lists by industry and region
Market Research — Analyse business density by county or CAEN category
Competitor Analysis — Map competitors in your sector by region
Local SEO — Build citation lists for Romanian businesses

Try It

The tool is on the Apify Store: Romanian Business Directory Scraper

Pricing: $0.01 per business listing extracted. Free tier covers ~500 listings.

Anyone else building tools for the Romanian/Eastern European market? Would love to hear what directories you're working with.

Make Any Website AI-Readable: Generating llms.txt Files with Python

Oaida Adrian — Sat, 04 Jul 2026 10:31:31 +0000

AI assistants like ChatGPT, Claude, and Perplexity are increasingly crawling the web for context. But most websites aren't optimised for AI readability — they're built for human browsers with complex HTML, JavaScript navigation, and boilerplate-heavy layouts.

The llms.txt standard is changing this. It's a simple convention: place a llms.txt file at your site root that gives AI systems clean, structured content they can actually understand.

I built a tool that generates these files automatically for any website.

What is llms.txt?

Think of it as robots.txt but for LLMs. Three files form the standard:

llms.txt — A curated summary of your site with key links
llms-full.txt — Complete site content in clean markdown
Per-page data — Structured JSON with extracted content per URL

The Generator

The llms.txt Generator crawls any website using BFS (Breadth-First Search) and:

Respects configurable crawl depth and URL filters
Extracts clean content via trafilatura (not regex — actual text extraction)
Outputs markdown or plaintext
Handles JavaScript-rendered pages
Produces both summary and full-content files

Why This Matters for SEO

Traditional SEO targets Google's crawler. But a new category is emerging: SEO for AI.

When a user asks ChatGPT "what is [your product]?, the AI searches its training data and web results. If your site has a clean llms.txt, the AI gets structured, accurate content instead of parsing your homepage HTML.

Input Parameters

Parameter	Default	Description
`startUrls`	required	Website URLs to crawl
`maxPages`	50	Maximum pages to process
`outputFormat`	markdown	Output format (markdown/plaintext)
`includePatterns`	[]	URL patterns to include
`excludePatterns`	[]	URL patterns to exclude

Example: Documenting a Python Library

I tested it on Pydantic's documentation (docs.pydantic.dev). The crawler:

Started at the root docs page
Followed internal links via BFS
Extracted clean content from each page
Produced a structured dataset with per-page markdown

Result: 2 pages processed, full content extracted with zero boilerplate.

Try It

Live on the Apify Store: llms.txt Generator

Pricing is $0.01 per page processed. Free tier covers ~50 pages.

The llms.txt standard is still emerging, but early adopters will have an advantage as AI-driven search grows. Is your website AI-readable?

I Built an RSS Aggregator That Extracts Full Article Content (Not Just Summaries)

Oaida Adrian — Sat, 04 Jul 2026 10:30:50 +0000

Most RSS feed readers give you a 200-character summary and force you to click through to read the full article. That's useless if you're building news monitoring pipelines, AI training datasets, or content curation tools.

So I built a proper RSS Feed Aggregator that follows each article link and extracts the complete full-text content — clean, structured, and ready to use.

What It Does

Multi-feed ingestion — Point it at multiple RSS/Atom feeds simultaneously
Full-text extraction — Uses trafilatura to extract the actual article content, stripping boilerplate, ads, and navigation
Deduplication — Automatically detects and removes duplicate articles across feeds
Rich metadata — Word counts, authorship, publish dates, images, source tracking
Keyword filtering — Include/exclude articles by keywords

Example Output

Each article comes back as structured JSON:

{
  "title": "The only AI glossary you'll need this year",
  "fullContent": "...3,727 words of clean extracted text...",
  "author": "Kyle Wiggers",
  "publishedDate": "2026-07-04T10:00:00Z",
  "wordCount": 3727,
  "imageUrl": "https://...",
  "sourceFeed": "https://techcrunch.com/feed/",
  "sourceUrl": "https://techcrunch.com/2026/07/04/..."
}

Real-World Use Cases

AI/LLM Training Data — Need clean text without HTML boilerplate? This outputs publication-ready content.
News Monitoring — Aggregate dozens of feeds and get full articles, not snippets.
Content Curation — Pull from multiple sources, deduplicate, filter by keywords.
Research Pipelines — Collect articles on specific topics for analysis.

Try It

The tool is live on the Apify Store: RSS Feed Aggregator & Article Extractor

It uses pay-per-event pricing at $0.01 per article extracted. If you're on Apify's free tier ($5/mo credits), that covers ~500 articles — enough for a solid test run.

Input Parameters

Parameter	Default	Description
`feedUrls`	required	RSS/Atom feed URLs
`maxResults`	50	Maximum articles to extract
`extractContent`	true	Follow links and extract full text
`deduplicate`	true	Remove duplicate articles
`keywordFilter`	[]	Include/exclude keywords

How Full-Text Extraction Works

The actor uses trafilatura, a Python library specifically designed for web text extraction. Unlike basic regex or BeautifulSoup approaches, trafilatura:

Strips navigation, sidebars, footers, and ads
Preserves article structure (paragraphs, headings)
Handles JavaScript-rendered content
Works across 20+ languages

This means you get the actual article text — not the RSS description, not a truncated summary, but the full content as the author wrote it.

If you're working with RSS feeds or news data, give it a try. Happy to add features based on feedback — what would make this useful for your use case?