Murrough Foley

Posted on Apr 3

How to Use rs-trafilatura with crawl4ai

#webcontentextraction #rust #scraping

crawl4ai is an async web crawler built for producing LLM-friendly output. By default, it converts pages to Markdown using its own scraping pipeline. But if you want page-type-aware content extraction with quality scoring, you can swap in rs-trafilatura as the extraction strategy.

This tutorial shows how to set that up.

Install

pip install rs-trafilatura crawl4ai

If this is your first time with crawl4ai, you also need Playwright browsers:

python -m playwright install chromium

Basic Usage

rs-trafilatura provides RsTrafilaturaStrategy, a drop-in replacement for crawl4ai's built-in extraction strategies:

import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy

async def main():
    strategy = RsTrafilaturaStrategy()
    config = CrawlerRunConfig(extraction_strategy=strategy)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com", config=config)

    data = json.loads(result.extracted_content)
    item = data[0]

    print(f"Title: {item['title']}")
    print(f"Page type: {item['page_type']}")
    print(f"Quality: {item['extraction_quality']}")
    print(f"Content: {item['main_content'][:200]}")

asyncio.run(main())

The extracted content is a JSON array with one item containing the extraction result. crawl4ai serialises it automatically — you just json.loads() the extracted_content field.

What You Get Back

Each extraction result is a dict with these fields:

Field	Description
`title`	Page title
`author`	Author name (if detected)
`date`	Publication date (ISO 8601)
`main_content`	Clean extracted text
`content_markdown`	Markdown output (if enabled)
`page_type`	article, forum, product, collection, listing, documentation, service
`extraction_quality`	0.0–1.0 confidence score
`language`	Detected language
`sitename`	Site name
`description`	Meta description

Enabling Markdown Output

Pass output_markdown=True to get Markdown alongside plain text:

strategy = RsTrafilaturaStrategy(output_markdown=True)
config = CrawlerRunConfig(extraction_strategy=strategy)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com", config=config)

data = json.loads(result.extracted_content)
markdown = data[0]["content_markdown"]

This gives you GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks, and links preserved.

Precision vs Recall

By default, rs-trafilatura balances precision and recall. You can tip the scale:

# Stricter filtering — less noise, may miss some content
strategy = RsTrafilaturaStrategy(favor_precision=True)

# More inclusive — captures more content, may include some boilerplate
strategy = RsTrafilaturaStrategy(favor_recall=True)

Crawling Multiple Pages

crawl4ai handles concurrency. rs-trafilatura runs extraction in a thread per page, so it doesn't block the async crawl loop:

async def main():
    strategy = RsTrafilaturaStrategy(output_markdown=True)
    config = CrawlerRunConfig(extraction_strategy=strategy)

    urls = [
        "https://example.com/blog/post-1",
        "https://example.com/products/widget",
        "https://example.com/docs/getting-started",
        "https://forum.example.com/thread/123",
    ]

    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url=url, config=config)
            data = json.loads(result.extracted_content)
            item = data[0]
            print(f"[{item['page_type']}] {item['title']} (quality: {item['extraction_quality']:.2f})")

Each page gets classified into its type and extracted with the appropriate profile. A product page gets JSON-LD fallback. A forum thread gets comment-as-content handling. A docs page gets sidebar removal. All automatic.

Using the Quality Score for Hybrid Pipelines

The extraction_quality field tells you how confident rs-trafilatura is in its extraction. You can use this to build a hybrid pipeline — fast heuristic extraction for most pages, with LLM fallback for the hard cases:

from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def extract_with_fallback(crawler, url, config):
    result = await crawler.arun(url=url, config=config)
    data = json.loads(result.extracted_content)
    item = data[0]

    if item["extraction_quality"] < 0.80:
        # Low confidence — use crawl4ai's built-in LLM extraction as fallback
        llm_config = CrawlerRunConfig(
            extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o-mini")
        )
        result = await crawler.arun(url=url, config=llm_config)
        return result.extracted_content

    return item["main_content"]

On the WCXB benchmark, about 8% of pages score below 0.80. Routing just those pages to a neural fallback improves the overall F1 from 0.859 to 0.862 on the development set and from 0.893 to 0.910 on the held-out test set.

How It Works Under the Hood

RsTrafilaturaStrategy inherits from crawl4ai's ExtractionStrategy when crawl4ai is installed, so it passes the isinstance() check in CrawlerRunConfig. It sets input_format="html" which tells crawl4ai to pass raw HTML (not Markdown) and to skip chunking. The extraction runs in Rust via PyO3 — no subprocess, no binary to find.

DEV Community