Web content extraction is the task of isolating the main content of a web page from its surrounding boilerplate — navigation menus, cookie banners, ads, sidebars, footers, and the other 80% of a page that isn't the actual content. If you process web pages at scale, you need it. Search engines use it for indexing. RAG pipelines use it to feed clean context to LLMs. SEO practitioners use it to approximate what Google sees when it evaluates a page.
The open-source ecosystem for this is strong. Trafilatura (Python), Readability (JavaScript), jusText, BoilerPy3 — all solid tools that work well on news articles and blog posts. On articles, the top systems all converge above F1 = 0.90. The problem is largely solved.
But the web is not just articles.
The Problem: Everything That Isn't an Article
When I was running SEO audits across thousands of competitor pages from search results, I kept hitting the same issue. The extraction tools worked on articles but fell apart on everything else:
- Product pages encode descriptions in JSON-LD structured data rather than the visible DOM. Extractors that only parse what they see in the HTML miss the content entirely.
-
Forum pages wrap user posts in CSS classes like
commentandreply— the exact patterns that article extractors have been trained to strip as boilerplate. -
Service pages spread content across 5 to 15 independent
<section>elements. An extractor that picks a single best node grabs the hero section and throws away the rest. - Documentation pages embed content alongside sidebar navigation, version pickers, and table-of-contents panels that extractors include as if they were content.
These aren't edge cases. In the WCEB dataset, 47% of pages are non-articles. And the failures are architectural — no amount of parameter tuning within an article-focused extractor can fix them. You need a different extraction strategy for each page type.
What rs-trafilatura Does Differently
rs-trafilatura is a Rust library that classifies pages into seven types (article, forum, product, collection, listing, documentation, service) and applies type-specific extraction profiles. The classifier is a three-stage pipeline:
-
URL heuristics — fast pattern matching on domain and path. URLs containing
/forum/,/products/, ordocs.subdomains resolve immediately. This handles ~63% of pages. -
HTML signal analysis — JSON-LD
@typevalues, Open Graph meta tags, DOM patterns like product grids or code blocks. This catches ~15% more. - XGBoost ML classifier — 200 trees over 181 features (DOM structure, vocabulary density, link ratios) for the remaining ambiguous pages. 86.6% accuracy.
Once the page type is known, extraction uses a type-specific profile:
- Forums get
comments_are_content = trueand platform-specific selectors for XenForo, vBulletin, Discourse, phpBB, and others. - Service pages get multi-candidate content merging — selecting the top-scoring sections and concatenating them.
- Products get JSON-LD structured data fallback when DOM extraction produces poor results.
- Documentation gets framework-specific boilerplate removal for Sphinx, Rustdoc, MDN, and ReadTheDocs.
Extraction Quality Predictor
After extraction, an ML quality predictor estimates how reliable the result is. It's a 27-feature XGBoost regression model that predicts the expected F1 score. Pages scoring below 0.80 are candidates for LLM fallback — you can route them to MinerU-HTML or another neural extractor while keeping the fast heuristic path for the 92% of pages where it works well.
This is the basis for hybrid extraction pipelines: heuristic speed on most pages, neural quality on the hard cases.
Performance
Benchmarked on the Web Content Extraction Benchmark (WCEB) — 1,497 pages from 1,295 domains across 7 page types:
| System | F1 | Speed |
|---|---|---|
| rs-trafilatura | 0.859 | 44 ms/page |
| MinerU-HTML (0.6B) | 0.827 | 1,570 ms/page |
| Trafilatura (Python) | 0.791 | 94 ms/page |
| ReaderLM-v2 (1.5B) | 0.741 | 10,410 ms/page |
On a separate 511-page held-out test set (never used during development), rs-trafilatura achieves F1 = 0.893, confirming that results generalise. With a hybrid pipeline routing low-confidence pages to MinerU-HTML, the held-out F1 reaches 0.910.
On articles, every top system converges around F1 = 0.93 — the differences are marginal. The gap opens on non-article types:
| Page Type | rs-trafilatura | Trafilatura |
|---|---|---|
| Article | 0.932 | 0.926 |
| Documentation | 0.931 | 0.888 |
| Service | 0.843 | 0.763 |
| Forum | 0.792 | 0.585 |
| Collection | 0.713 | 0.553 |
| Listing | 0.704 | 0.589 |
| Product | 0.670 | 0.567 |
Forums: +0.207 over Trafilatura. Collections: +0.160. These aren't marginal improvements — they're the difference between getting most of the content and getting almost none.
Using rs-trafilatura
Rust
use rs_trafilatura::extract;
let result = extract(html)?;
println!("Title: {:?}", result.metadata.title);
println!("Content: {}", result.content_text);
println!("Page type: {:?}", result.metadata.page_type);
println!("Confidence: {:.2}", result.extraction_quality);
[dependencies]
rs-trafilatura = "0.2"
Python
import rs_trafilatura
result = rs_trafilatura.extract(html, url="https://example.com")
print(result.title, result.main_content, result.page_type, result.extraction_quality)
pip install rs-trafilatura
The Python package bundles four Rust crates into a single native extension: content extraction, page type classification, HTML cleaning, and Markdown conversion. No subprocess overhead — it's compiled Rust called directly from Python via PyO3.
Spider-rs Integration
use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;
let mut website = Website::new("https://example.com");
website.crawl().await;
for page in website.get_pages().into_iter().flatten() {
if let Ok(result) = extract_page(&page) {
println!("[{}] {}", result.metadata.page_type.unwrap_or_default(), result.metadata.title.unwrap_or_default());
}
}
[dependencies]
rs-trafilatura = { version = "0.2", features = ["spider"] }
The Benchmark
WCEB is the dataset behind these numbers — 2,008 pages from 1,613 domains, with a dev/test split and ground truth annotations for all seven page types. It's the first benchmark that measures extraction quality across structurally different page types rather than just articles.
- Website: webcontentextraction.org
- Dataset: GitHub · Zenodo · HuggingFace
- Rust crate: crates.io/crates/rs-trafilatura · GitHub
- Python package: pypi.org/project/rs-trafilatura · GitHub
If you're extracting content at scale and need reliability beyond articles, give it a try. If you run it on your own data, I'd love to hear how it performs.
Top comments (0)