crawl4ai is an async web crawler built for producing LLM-friendly output. By default, it converts pages to Markdown using its own scraping pipeline. But if you want page-type-aware content extraction with quality scoring, you can swap in rs-trafilatura as the extraction strategy.
This tutorial shows how to set that up.
Install
pip install rs-trafilatura crawl4ai
If this is your first time with crawl4ai, you also need Playwright browsers:
python -m playwright install chromium
Basic Usage
rs-trafilatura provides RsTrafilaturaStrategy, a drop-in replacement for crawl4ai's built-in extraction strategies:
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy
async def main():
strategy = RsTrafilaturaStrategy()
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
data = json.loads(result.extracted_content)
item = data[0]
print(f"Title: {item['title']}")
print(f"Page type: {item['page_type']}")
print(f"Quality: {item['extraction_quality']}")
print(f"Content: {item['main_content'][:200]}")
asyncio.run(main())
The extracted content is a JSON array with one item containing the extraction result. crawl4ai serialises it automatically — you just json.loads() the extracted_content field.
What You Get Back
Each extraction result is a dict with these fields:
| Field | Description |
|---|---|
title |
Page title |
author |
Author name (if detected) |
date |
Publication date (ISO 8601) |
main_content |
Clean extracted text |
content_markdown |
Markdown output (if enabled) |
page_type |
article, forum, product, collection, listing, documentation, service |
extraction_quality |
0.0–1.0 confidence score |
language |
Detected language |
sitename |
Site name |
description |
Meta description |
Enabling Markdown Output
Pass output_markdown=True to get Markdown alongside plain text:
strategy = RsTrafilaturaStrategy(output_markdown=True)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
data = json.loads(result.extracted_content)
markdown = data[0]["content_markdown"]
This gives you GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks, and links preserved.
Precision vs Recall
By default, rs-trafilatura balances precision and recall. You can tip the scale:
# Stricter filtering — less noise, may miss some content
strategy = RsTrafilaturaStrategy(favor_precision=True)
# More inclusive — captures more content, may include some boilerplate
strategy = RsTrafilaturaStrategy(favor_recall=True)
Crawling Multiple Pages
crawl4ai handles concurrency. rs-trafilatura runs extraction in a thread per page, so it doesn't block the async crawl loop:
async def main():
strategy = RsTrafilaturaStrategy(output_markdown=True)
config = CrawlerRunConfig(extraction_strategy=strategy)
urls = [
"https://example.com/blog/post-1",
"https://example.com/products/widget",
"https://example.com/docs/getting-started",
"https://forum.example.com/thread/123",
]
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url, config=config)
data = json.loads(result.extracted_content)
item = data[0]
print(f"[{item['page_type']}] {item['title']} (quality: {item['extraction_quality']:.2f})")
Each page gets classified into its type and extracted with the appropriate profile. A product page gets JSON-LD fallback. A forum thread gets comment-as-content handling. A docs page gets sidebar removal. All automatic.
Using the Quality Score for Hybrid Pipelines
The extraction_quality field tells you how confident rs-trafilatura is in its extraction. You can use this to build a hybrid pipeline — fast heuristic extraction for most pages, with LLM fallback for the hard cases:
from crawl4ai.extraction_strategy import LLMExtractionStrategy
async def extract_with_fallback(crawler, url, config):
result = await crawler.arun(url=url, config=config)
data = json.loads(result.extracted_content)
item = data[0]
if item["extraction_quality"] < 0.80:
# Low confidence — use crawl4ai's built-in LLM extraction as fallback
llm_config = CrawlerRunConfig(
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o-mini")
)
result = await crawler.arun(url=url, config=llm_config)
return result.extracted_content
return item["main_content"]
On the WCEB benchmark, about 8% of pages score below 0.80. Routing just those pages to a neural fallback improves the overall F1 from 0.859 to 0.862 on the development set and from 0.893 to 0.910 on the held-out test set.
How It Works Under the Hood
RsTrafilaturaStrategy inherits from crawl4ai's ExtractionStrategy when crawl4ai is installed, so it passes the isinstance() check in CrawlerRunConfig. It sets input_format="html" which tells crawl4ai to pass raw HTML (not Markdown) and to skip chunking. The extraction runs in Rust via PyO3 — no subprocess, no binary to find.
Links
- rs-trafilatura Python package: pypi.org/project/rs-trafilatura · GitHub
- Rust crate: crates.io/crates/rs-trafilatura · GitHub
- crawl4ai: github.com/unclecode/crawl4ai
- Benchmark: webcontentextraction.org · GitHub · Zenodo
Top comments (0)