DEV Community

Cover image for How to Use rs-trafilatura with crawl4ai
Murrough Foley
Murrough Foley

Posted on

How to Use rs-trafilatura with crawl4ai

crawl4ai is an async web crawler built for producing LLM-friendly output. By default, it converts pages to Markdown using its own scraping pipeline. But if you want page-type-aware content extraction with quality scoring, you can swap in rs-trafilatura as the extraction strategy.

This tutorial shows how to set that up.

Install

pip install rs-trafilatura crawl4ai
Enter fullscreen mode Exit fullscreen mode

If this is your first time with crawl4ai, you also need Playwright browsers:

python -m playwright install chromium
Enter fullscreen mode Exit fullscreen mode

Basic Usage

rs-trafilatura provides RsTrafilaturaStrategy, a drop-in replacement for crawl4ai's built-in extraction strategies:

import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy

async def main():
    strategy = RsTrafilaturaStrategy()
    config = CrawlerRunConfig(extraction_strategy=strategy)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com", config=config)

    data = json.loads(result.extracted_content)
    item = data[0]

    print(f"Title: {item['title']}")
    print(f"Page type: {item['page_type']}")
    print(f"Quality: {item['extraction_quality']}")
    print(f"Content: {item['main_content'][:200]}")

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

The extracted content is a JSON array with one item containing the extraction result. crawl4ai serialises it automatically — you just json.loads() the extracted_content field.

What You Get Back

Each extraction result is a dict with these fields:

Field Description
title Page title
author Author name (if detected)
date Publication date (ISO 8601)
main_content Clean extracted text
content_markdown Markdown output (if enabled)
page_type article, forum, product, collection, listing, documentation, service
extraction_quality 0.0–1.0 confidence score
language Detected language
sitename Site name
description Meta description

Enabling Markdown Output

Pass output_markdown=True to get Markdown alongside plain text:

strategy = RsTrafilaturaStrategy(output_markdown=True)
config = CrawlerRunConfig(extraction_strategy=strategy)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com", config=config)

data = json.loads(result.extracted_content)
markdown = data[0]["content_markdown"]
Enter fullscreen mode Exit fullscreen mode

This gives you GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks, and links preserved.

Precision vs Recall

By default, rs-trafilatura balances precision and recall. You can tip the scale:

# Stricter filtering — less noise, may miss some content
strategy = RsTrafilaturaStrategy(favor_precision=True)

# More inclusive — captures more content, may include some boilerplate
strategy = RsTrafilaturaStrategy(favor_recall=True)
Enter fullscreen mode Exit fullscreen mode

Crawling Multiple Pages

crawl4ai handles concurrency. rs-trafilatura runs extraction in a thread per page, so it doesn't block the async crawl loop:

async def main():
    strategy = RsTrafilaturaStrategy(output_markdown=True)
    config = CrawlerRunConfig(extraction_strategy=strategy)

    urls = [
        "https://example.com/blog/post-1",
        "https://example.com/products/widget",
        "https://example.com/docs/getting-started",
        "https://forum.example.com/thread/123",
    ]

    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url=url, config=config)
            data = json.loads(result.extracted_content)
            item = data[0]
            print(f"[{item['page_type']}] {item['title']} (quality: {item['extraction_quality']:.2f})")
Enter fullscreen mode Exit fullscreen mode

Each page gets classified into its type and extracted with the appropriate profile. A product page gets JSON-LD fallback. A forum thread gets comment-as-content handling. A docs page gets sidebar removal. All automatic.

Using the Quality Score for Hybrid Pipelines

The extraction_quality field tells you how confident rs-trafilatura is in its extraction. You can use this to build a hybrid pipeline — fast heuristic extraction for most pages, with LLM fallback for the hard cases:

from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def extract_with_fallback(crawler, url, config):
    result = await crawler.arun(url=url, config=config)
    data = json.loads(result.extracted_content)
    item = data[0]

    if item["extraction_quality"] < 0.80:
        # Low confidence — use crawl4ai's built-in LLM extraction as fallback
        llm_config = CrawlerRunConfig(
            extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o-mini")
        )
        result = await crawler.arun(url=url, config=llm_config)
        return result.extracted_content

    return item["main_content"]
Enter fullscreen mode Exit fullscreen mode

On the WCEB benchmark, about 8% of pages score below 0.80. Routing just those pages to a neural fallback improves the overall F1 from 0.859 to 0.862 on the development set and from 0.893 to 0.910 on the held-out test set.

How It Works Under the Hood

RsTrafilaturaStrategy inherits from crawl4ai's ExtractionStrategy when crawl4ai is installed, so it passes the isinstance() check in CrawlerRunConfig. It sets input_format="html" which tells crawl4ai to pass raw HTML (not Markdown) and to skip chunking. The extraction runs in Rust via PyO3 — no subprocess, no binary to find.

Links

Top comments (0)