spider is a high-performance async web crawler written in Rust. It discovers, fetches, and queues URLs — but content extraction is left to you. rs-trafilatura slots in as the extraction layer, giving you page-type-aware content extraction with quality scoring on every crawled page.
Setup
Add both crates to your Cargo.toml:
[dependencies]
rs-trafilatura = { version = "0.2", features = ["spider"] }
spider = "2"
tokio = { version = "1", features = ["full"] }
The spider feature flag enables rs_trafilatura::spider_integration, which provides convenience functions that accept spider's Page type directly.
Basic: Crawl Then Extract
The simplest approach — crawl a site, then extract content from every page:
use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
website.crawl().await;
for page in website.get_pages().into_iter().flatten() {
match extract_page(&page) {
Ok(result) => {
println!("[{}] {} (confidence: {:.2})",
result.metadata.page_type.unwrap_or_default(),
result.metadata.title.unwrap_or_default(),
result.extraction_quality,
);
println!(" Content: {} chars", result.content_text.len());
}
Err(e) => eprintln!(" Extraction failed: {e}"),
}
}
}
extract_page takes a &Page and returns Result<ExtractResult>. The page URL is automatically passed to the classifier for page type detection.
Streaming: Extract As Pages Arrive
For large crawls, you don't want to wait until everything is fetched. spider's subscribe channel lets you process pages as they arrive:
use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
let mut rx = website.subscribe(0).unwrap();
let handle = tokio::spawn(async move {
let mut count = 0;
while let Ok(page) = rx.recv().await {
if let Ok(result) = extract_page(&page) {
count += 1;
println!("[{count}] {} → {} ({:.2})",
page.get_url(),
result.metadata.page_type.unwrap_or_default(),
result.extraction_quality,
);
}
}
println!("Extracted {count} pages");
});
website.crawl().await;
website.unsubscribe();
let _ = handle.await;
}
Each page is extracted in the spawned task as soon as spider fetches it. Extraction takes ~44ms per page, so it easily keeps up with typical crawl rates.
Custom Options
Use extract_page_with_options for fine-grained control:
use rs_trafilatura::{Options, spider_integration::extract_page_with_options};
use rs_trafilatura::page_type::PageType;
let options = Options {
output_markdown: true, // Get GFM Markdown output
include_images: true, // Extract image metadata
favor_precision: true, // Stricter filtering
page_type: Some(PageType::Product), // Force page type
..Options::default()
};
let result = extract_page_with_options(&page, &options)?;
if let Some(md) = &result.content_markdown {
println!("Markdown:\n{}", md);
}
for img in &result.images {
println!("Image: {} (hero: {})", img.src, img.is_hero);
}
If you provide url in the options, it takes precedence over the page URL for classification. If you don't, the page URL is used automatically.
Quality-Gated Processing
The extraction quality score lets you filter or flag low-confidence results:
for page in website.get_pages().into_iter().flatten() {
let url = page.get_url().to_string();
let result = extract_page(&page)?;
if result.extraction_quality < 0.80 {
eprintln!("⚠ Low confidence on {url}: {:.2}", result.extraction_quality);
// Log for manual review, or route to a fallback extractor
continue;
}
// Process high-confidence extractions
save_to_database(&result);
}
On the WCEB benchmark, about 8% of pages score below 0.80. These are typically product pages with content in JSON-LD, forums with unusual markup, or service pages with highly distributed content.
What extract_page Returns
ExtractResult gives you:
| Field | Type | Description |
|---|---|---|
content_text |
String |
Main content as plain text |
content_markdown |
Option<String> |
GFM Markdown (when enabled) |
content_html |
Option<String> |
Extracted content as HTML |
metadata.title |
Option<String> |
Page title |
metadata.author |
Option<String> |
Author name |
metadata.date |
Option<DateTime> |
Publication date |
metadata.page_type |
Option<String> |
Detected page type |
extraction_quality |
f64 |
0.0–1.0 confidence score |
images |
Vec<ImageData> |
Image URLs, alt text, captions |
Why Not spider_transformations?
spider ships with its own spider_transformations crate that can convert pages to Markdown or plain text. It works, but it's a basic readability-style extractor without:
- ML page type classification
- Type-specific extraction profiles (forum comment handling, multi-section merge, JSON-LD fallback)
- Extraction quality scoring
- Structured metadata extraction from JSON-LD, Open Graph, and Dublin Core
rs-trafilatura gives you all of these. For article-heavy crawls, spider_transformations is fine. For crawls that hit diverse page types, rs-trafilatura produces substantially better results.
Links
- rs-trafilatura: crates.io/crates/rs-trafilatura · GitHub
- Python package: pypi.org/project/rs-trafilatura
- spider: crates.io/crates/spider
- Benchmark: webcontentextraction.org · GitHub · Zenodo
Top comments (0)