Murrough Foley

Posted on Apr 3 • Edited on Apr 6

rs-trafilatura: Page-Type-Aware Web Content Extraction in Rust

#wecontetextraction #scraping #rust

Web content extraction is the task of isolating the main content of a web page from its surrounding boilerplate — navigation menus, cookie banners, ads, sidebars, footers, and the other 80% of a page that isn't the actual content. If you process web pages at scale, you need it. Search engines use it for indexing. RAG pipelines use it to feed clean context to LLMs. SEO practitioners use it to approximate what Google sees when it evaluates a page.

The open-source ecosystem for this is strong. Trafilatura (Python), Readability (JavaScript), jusText, BoilerPy3 — all solid tools that work well on news articles and blog posts. On articles, the top systems all converge above F1 = 0.90. The problem is largely solved.

But the web is not just articles.

The Problem: Everything That Isn't an Article

When I was running SEO audits across thousands of competitor pages from search results, I kept hitting the same issue. The extraction tools worked on articles but fell apart on everything else:

Product pages encode descriptions in JSON-LD structured data rather than the visible DOM. Extractors that only parse what they see in the HTML miss the content entirely.
Forum pages wrap user posts in CSS classes like comment and reply — the exact patterns that article extractors have been trained to strip as boilerplate.
Service pages spread content across 5 to 15 independent <section> elements. An extractor that picks a single best node grabs the hero section and throws away the rest.
Documentation pages embed content alongside sidebar navigation, version pickers, and table-of-contents panels that extractors include as if they were content.

These aren't edge cases. In the WCXB dataset, 47% of pages are non-articles. And the failures are architectural — no amount of parameter tuning within an article-focused extractor can fix them. You need a different extraction strategy for each page type.

What rs-trafilatura Does Differently

rs-trafilatura is a Rust library that classifies pages into seven types (article, forum, product, collection, listing, documentation, service) and applies type-specific extraction profiles. The classifier is a three-stage pipeline:

URL heuristics — fast pattern matching on domain and path. URLs containing /forum/, /products/, or docs. subdomains resolve immediately. This handles ~63% of pages.
HTML signal analysis — JSON-LD @type values, Open Graph meta tags, DOM patterns like product grids or code blocks. This catches ~15% more.
XGBoost ML classifier — 200 trees over 181 features (DOM structure, vocabulary density, link ratios) for the remaining ambiguous pages. 86.6% accuracy.

Once the page type is known, extraction uses a type-specific profile:

Forums get comments_are_content = true and platform-specific selectors for XenForo, vBulletin, Discourse, phpBB, and others.
Service pages get multi-candidate content merging — selecting the top-scoring sections and concatenating them.
Products get JSON-LD structured data fallback when DOM extraction produces poor results.
Documentation gets framework-specific boilerplate removal for Sphinx, Rustdoc, MDN, and ReadTheDocs.

Extraction Quality Predictor

After extraction, an ML quality predictor estimates how reliable the result is. It's a 27-feature XGBoost regression model that predicts the expected F1 score. Pages scoring below 0.80 are candidates for LLM fallback — you can route them to MinerU-HTML or another neural extractor while keeping the fast heuristic path for the 92% of pages where it works well.

This is the basis for hybrid extraction pipelines: heuristic speed on most pages, neural quality on the hard cases.

Performance

Benchmarked on the Web Content eXtraction Benchmark WCXB — 1,497 pages from 1,295 domains across 7 page types:

System	F1	Speed
rs-trafilatura	0.859	44 ms/page
MinerU-HTML (0.6B)	0.827	1,570 ms/page
Trafilatura (Python)	0.791	94 ms/page
ReaderLM-v2 (1.5B)	0.741	10,410 ms/page

On a separate 511-page held-out test set (never used during development), rs-trafilatura achieves F1 = 0.893, confirming that results generalise. With a hybrid pipeline routing low-confidence pages to MinerU-HTML, the held-out F1 reaches 0.910.

On articles, every top system converges around F1 = 0.93 — the differences are marginal. The gap opens on non-article types:

Page Type	rs-trafilatura	Trafilatura
Article	0.932	0.926
Documentation	0.931	0.888
Service	0.843	0.763
Forum	0.792	0.585
Collection	0.713	0.553
Listing	0.704	0.589
Product	0.670	0.567

Forums: +0.207 over Trafilatura. Collections: +0.160. These aren't marginal improvements — they're the difference between getting most of the content and getting almost none.

Using rs-trafilatura

Rust

use rs_trafilatura::extract;

let result = extract(html)?;
println!("Title: {:?}", result.metadata.title);
println!("Content: {}", result.content_text);
println!("Page type: {:?}", result.metadata.page_type);
println!("Confidence: {:.2}", result.extraction_quality);

[dependencies]
rs-trafilatura = "0.2"

Python

import rs_trafilatura

result = rs_trafilatura.extract(html, url="https://example.com")
print(result.title, result.main_content, result.page_type, result.extraction_quality)

pip install rs-trafilatura

The Python package bundles four Rust crates into a single native extension: content extraction, page type classification, HTML cleaning, and Markdown conversion. No subprocess overhead — it's compiled Rust called directly from Python via PyO3.

Spider-rs Integration

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

let mut website = Website::new("https://example.com");
website.crawl().await;

for page in website.get_pages().into_iter().flatten() {
    if let Ok(result) = extract_page(&page) {
        println!("[{}] {}", result.metadata.page_type.unwrap_or_default(), result.metadata.title.unwrap_or_default());
    }
}

[dependencies]
rs-trafilatura = { version = "0.2", features = ["spider"] }

The Benchmark

WCXB is the dataset behind these numbers — 2,008 pages from 1,613 domains, with a dev/test split and ground truth annotations for all seven page types. It's the first benchmark that measures extraction quality across structurally different page types rather than just articles.

Website: webcontentextraction.org
Dataset: GitHub · Zenodo · HuggingFace
Rust crate: crates.io/crates/rs-trafilatura · GitHub
Python package: pypi.org/project/rs-trafilatura · GitHub

If you're extracting content at scale and need reliability beyond articles, give it a try. If you run it on your own data, I'd love to hear how it performs.

Murrough Foley — Technical SEO consultant and researcher

DEV Community