Scrapy is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with HTML, and the pipeline adds structured extraction results automatically.
Install
pip install rs-trafilatura scrapy
Setup
Add the pipeline to your Scrapy project's settings.py:
ITEM_PIPELINES = {
"rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}
That's it. Every item that passes through the pipeline with a body (bytes) or html (string) field will get an extraction dict added to it.
Writing the Spider
Your spider yields items with the response body and URL:
import scrapy
class ContentSpider(scrapy.Spider):
name = "content"
start_urls = ["https://example.com"]
def parse(self, response):
yield {
"url": response.url,
"body": response.body, # raw bytes — rs-trafilatura auto-detects encoding
}
# Follow links
for href in response.css("a::attr(href)").getall():
yield response.follow(href, self.parse)
The pipeline picks up body (bytes) or html (string). When it finds one, it runs extraction and adds the results under item["extraction"].
What the Pipeline Adds
Each processed item gets an extraction dict:
{
"url": "https://example.com/blog/post",
"body": b"<html>...",
"extraction": {
"title": "Blog Post Title",
"author": "John Doe",
"date": "2026-01-15T00:00:00+00:00",
"main_content": "The full extracted text...",
"content_markdown": "# Blog Post Title\n\nThe full extracted text...",
"page_type": "article",
"extraction_quality": 0.95,
"language": "en",
"sitename": "Example Blog",
"description": "A blog post about...",
}
}
Enabling Markdown Output
Add to settings.py:
RS_TRAFILATURA_MARKDOWN = True
This populates item["extraction"]["content_markdown"] with GitHub Flavored Markdown.
Filtering by Page Type
The page type classification lets you route items differently based on what kind of page they are:
class ContentSpider(scrapy.Spider):
name = "content"
start_urls = ["https://example.com"]
custom_settings = {
"ITEM_PIPELINES": {
"rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
"myproject.pipelines.PageTypeRouter": 400,
},
}
def parse(self, response):
yield {"url": response.url, "body": response.body}
for href in response.css("a::attr(href)").getall():
yield response.follow(href, self.parse)
# myproject/pipelines.py
class PageTypeRouter:
def process_item(self, item, spider):
ext = item.get("extraction", {})
page_type = ext.get("page_type", "article")
if page_type == "product":
# Save to products table
save_product(item)
elif page_type == "forum":
# Save to discussions table
save_forum_post(item)
elif page_type == "article":
# Save to articles table
save_article(item)
else:
# Default handling
save_generic(item)
return item
Filtering by Extraction Quality
Drop items where extraction quality is low:
class QualityFilter:
def process_item(self, item, spider):
ext = item.get("extraction", {})
quality = ext.get("extraction_quality", 0)
if quality < 0.5:
raise scrapy.exceptions.DropItem(
f"Low extraction quality ({quality:.2f}): {item['url']}"
)
return item
Add it before the router:
ITEM_PIPELINES = {
"rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
"myproject.pipelines.QualityFilter": 350,
"myproject.pipelines.PageTypeRouter": 400,
}
Exporting to JSON Lines
Scrapy's built-in feed exports work out of the box:
scrapy crawl content -o output.jsonl
Each line in output.jsonl will contain the full item including the extraction dict. You can then process it with any tool that reads JSON Lines.
Performance
rs-trafilatura extracts in ~44ms per page via compiled Rust (PyO3, no subprocess). On a typical Scrapy crawl, extraction adds negligible overhead compared to network latency. The pipeline processes items synchronously in the Scrapy reactor thread, but since extraction is CPU-bound and fast, it doesn't block the download pipeline.
For very high-throughput crawls (1000+ pages/second), consider running extraction in a separate process and communicating via Scrapy's item pipeline.
Items Without HTML
If an item doesn't have body or html, the pipeline passes it through unchanged:
# This item has no HTML — pipeline ignores it
yield {"url": response.url, "custom_data": "something"}
# → No "extraction" key added, item passes through as-is
Links
- rs-trafilatura Python package: pypi.org/project/rs-trafilatura · GitHub
- Rust crate: crates.io/crates/rs-trafilatura · GitHub
- Scrapy: scrapy.org
- Benchmark: webcontentextraction.org · GitHub · Zenodo
Top comments (0)