DEV Community

Cover image for How to Use rs-trafilatura with spider-rs
Murrough Foley
Murrough Foley

Posted on

How to Use rs-trafilatura with spider-rs

spider is a high-performance async web crawler written in Rust. It discovers, fetches, and queues URLs — but content extraction is left to you. rs-trafilatura slots in as the extraction layer, giving you page-type-aware content extraction with quality scoring on every crawled page.

Setup

Add both crates to your Cargo.toml:

[dependencies]
rs-trafilatura = { version = "0.2", features = ["spider"] }
spider = "2"
tokio = { version = "1", features = ["full"] }
Enter fullscreen mode Exit fullscreen mode

The spider feature flag enables rs_trafilatura::spider_integration, which provides convenience functions that accept spider's Page type directly.

Basic: Crawl Then Extract

The simplest approach — crawl a site, then extract content from every page:

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    website.crawl().await;

    for page in website.get_pages().into_iter().flatten() {
        match extract_page(&page) {
            Ok(result) => {
                println!("[{}] {} (confidence: {:.2})",
                    result.metadata.page_type.unwrap_or_default(),
                    result.metadata.title.unwrap_or_default(),
                    result.extraction_quality,
                );
                println!("  Content: {} chars", result.content_text.len());
            }
            Err(e) => eprintln!("  Extraction failed: {e}"),
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

extract_page takes a &Page and returns Result<ExtractResult>. The page URL is automatically passed to the classifier for page type detection.

Streaming: Extract As Pages Arrive

For large crawls, you don't want to wait until everything is fetched. spider's subscribe channel lets you process pages as they arrive:

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    let mut rx = website.subscribe(0).unwrap();

    let handle = tokio::spawn(async move {
        let mut count = 0;
        while let Ok(page) = rx.recv().await {
            if let Ok(result) = extract_page(&page) {
                count += 1;
                println!("[{count}] {} → {} ({:.2})",
                    page.get_url(),
                    result.metadata.page_type.unwrap_or_default(),
                    result.extraction_quality,
                );
            }
        }
        println!("Extracted {count} pages");
    });

    website.crawl().await;
    website.unsubscribe();
    let _ = handle.await;
}
Enter fullscreen mode Exit fullscreen mode

Each page is extracted in the spawned task as soon as spider fetches it. Extraction takes ~44ms per page, so it easily keeps up with typical crawl rates.

Custom Options

Use extract_page_with_options for fine-grained control:

use rs_trafilatura::{Options, spider_integration::extract_page_with_options};
use rs_trafilatura::page_type::PageType;

let options = Options {
    output_markdown: true,          // Get GFM Markdown output
    include_images: true,           // Extract image metadata
    favor_precision: true,          // Stricter filtering
    page_type: Some(PageType::Product),  // Force page type
    ..Options::default()
};

let result = extract_page_with_options(&page, &options)?;

if let Some(md) = &result.content_markdown {
    println!("Markdown:\n{}", md);
}

for img in &result.images {
    println!("Image: {} (hero: {})", img.src, img.is_hero);
}
Enter fullscreen mode Exit fullscreen mode

If you provide url in the options, it takes precedence over the page URL for classification. If you don't, the page URL is used automatically.

Quality-Gated Processing

The extraction quality score lets you filter or flag low-confidence results:

for page in website.get_pages().into_iter().flatten() {
    let url = page.get_url().to_string();
    let result = extract_page(&page)?;

    if result.extraction_quality < 0.80 {
        eprintln!("⚠ Low confidence on {url}: {:.2}", result.extraction_quality);
        // Log for manual review, or route to a fallback extractor
        continue;
    }

    // Process high-confidence extractions
    save_to_database(&result);
}
Enter fullscreen mode Exit fullscreen mode

On the WCEB benchmark, about 8% of pages score below 0.80. These are typically product pages with content in JSON-LD, forums with unusual markup, or service pages with highly distributed content.

What extract_page Returns

ExtractResult gives you:

Field Type Description
content_text String Main content as plain text
content_markdown Option<String> GFM Markdown (when enabled)
content_html Option<String> Extracted content as HTML
metadata.title Option<String> Page title
metadata.author Option<String> Author name
metadata.date Option<DateTime> Publication date
metadata.page_type Option<String> Detected page type
extraction_quality f64 0.0–1.0 confidence score
images Vec<ImageData> Image URLs, alt text, captions

Why Not spider_transformations?

spider ships with its own spider_transformations crate that can convert pages to Markdown or plain text. It works, but it's a basic readability-style extractor without:

  • ML page type classification
  • Type-specific extraction profiles (forum comment handling, multi-section merge, JSON-LD fallback)
  • Extraction quality scoring
  • Structured metadata extraction from JSON-LD, Open Graph, and Dublin Core

rs-trafilatura gives you all of these. For article-heavy crawls, spider_transformations is fine. For crawls that hit diverse page types, rs-trafilatura produces substantially better results.

Links

Top comments (0)