How to Use rs-trafilatura with spider-rs

#webcontentextraction #rust #scraping

spider is a high-performance async web crawler written in Rust. It discovers, fetches, and queues URLs — but content extraction is left to you. rs-trafilatura slots in as the extraction layer, giving you page-type-aware content extraction with quality scoring on every crawled page.

Setup

Add both crates to your Cargo.toml:

[dependencies]
rs-trafilatura = { version = "0.2", features = ["spider"] }
spider = "2"
tokio = { version = "1", features = ["full"] }

The spider feature flag enables rs_trafilatura::spider_integration, which provides convenience functions that accept spider's Page type directly.

Basic: Crawl Then Extract

The simplest approach — crawl a site, then extract content from every page:

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    website.crawl().await;

    for page in website.get_pages().into_iter().flatten() {
        match extract_page(&page) {
            Ok(result) => {
                println!("[{}] {} (confidence: {:.2})",
                    result.metadata.page_type.unwrap_or_default(),
                    result.metadata.title.unwrap_or_default(),
                    result.extraction_quality,
                );
                println!("  Content: {} chars", result.content_text.len());
            }
            Err(e) => eprintln!("  Extraction failed: {e}"),
        }
    }
}

extract_page takes a &Page and returns Result<ExtractResult>. The page URL is automatically passed to the classifier for page type detection.

Streaming: Extract As Pages Arrive

For large crawls, you don't want to wait until everything is fetched. spider's subscribe channel lets you process pages as they arrive:

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    let mut rx = website.subscribe(0).unwrap();

    let handle = tokio::spawn(async move {
        let mut count = 0;
        while let Ok(page) = rx.recv().await {
            if let Ok(result) = extract_page(&page) {
                count += 1;
                println!("[{count}] {} → {} ({:.2})",
                    page.get_url(),
                    result.metadata.page_type.unwrap_or_default(),
                    result.extraction_quality,
                );
            }
        }
        println!("Extracted {count} pages");
    });

    website.crawl().await;
    website.unsubscribe();
    let _ = handle.await;
}

Each page is extracted in the spawned task as soon as spider fetches it. Extraction takes ~44ms per page, so it easily keeps up with typical crawl rates.

Custom Options

Use extract_page_with_options for fine-grained control:

use rs_trafilatura::{Options, spider_integration::extract_page_with_options};
use rs_trafilatura::page_type::PageType;

let options = Options {
    output_markdown: true,          // Get GFM Markdown output
    include_images: true,           // Extract image metadata
    favor_precision: true,          // Stricter filtering
    page_type: Some(PageType::Product),  // Force page type
    ..Options::default()
};

let result = extract_page_with_options(&page, &options)?;

if let Some(md) = &result.content_markdown {
    println!("Markdown:\n{}", md);
}

for img in &result.images {
    println!("Image: {} (hero: {})", img.src, img.is_hero);
}

If you provide url in the options, it takes precedence over the page URL for classification. If you don't, the page URL is used automatically.

Quality-Gated Processing

The extraction quality score lets you filter or flag low-confidence results:

for page in website.get_pages().into_iter().flatten() {
    let url = page.get_url().to_string();
    let result = extract_page(&page)?;

    if result.extraction_quality < 0.80 {
        eprintln!("⚠ Low confidence on {url}: {:.2}", result.extraction_quality);
        // Log for manual review, or route to a fallback extractor
        continue;
    }

    // Process high-confidence extractions
    save_to_database(&result);
}

On the WCXB benchmark, about 8% of pages score below 0.80. These are typically product pages with content in JSON-LD, forums with unusual markup, or service pages with highly distributed content.

What extract_page Returns

ExtractResult gives you:

Field	Type	Description
`content_text`	`String`	Main content as plain text
`content_markdown`	`Option<String>`	GFM Markdown (when enabled)
`content_html`	`Option<String>`	Extracted content as HTML
`metadata.title`	`Option<String>`	Page title
`metadata.author`	`Option<String>`	Author name
`metadata.date`	`Option<DateTime>`	Publication date
`metadata.page_type`	`Option<String>`	Detected page type
`extraction_quality`	`f64`	0.0–1.0 confidence score
`images`	`Vec<ImageData>`	Image URLs, alt text, captions

Why Not spider_transformations?

spider ships with its own spider_transformations crate that can convert pages to Markdown or plain text. It works, but it's a basic readability-style extractor without:

ML page type classification
Type-specific extraction profiles (forum comment handling, multi-section merge, JSON-LD fallback)
Extraction quality scoring
Structured metadata extraction from JSON-LD, Open Graph, and Dublin Core

rs-trafilatura gives you all of these. For article-heavy crawls, spider_transformations is fine. For crawls that hit diverse page types, rs-trafilatura produces substantially better results.