DEV Community

Cover image for How to Use rs-trafilatura with Scrapy
Murrough Foley
Murrough Foley

Posted on

How to Use rs-trafilatura with Scrapy

Scrapy is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with HTML, and the pipeline adds structured extraction results automatically.

Install

pip install rs-trafilatura scrapy
Enter fullscreen mode Exit fullscreen mode

Setup

Add the pipeline to your Scrapy project's settings.py:

ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}
Enter fullscreen mode Exit fullscreen mode

That's it. Every item that passes through the pipeline with a body (bytes) or html (string) field will get an extraction dict added to it.

Writing the Spider

Your spider yields items with the response body and URL:

import scrapy

class ContentSpider(scrapy.Spider):
    name = "content"
    start_urls = ["https://example.com"]

    def parse(self, response):
        yield {
            "url": response.url,
            "body": response.body,  # raw bytes — rs-trafilatura auto-detects encoding
        }

        # Follow links
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, self.parse)
Enter fullscreen mode Exit fullscreen mode

The pipeline picks up body (bytes) or html (string). When it finds one, it runs extraction and adds the results under item["extraction"].

What the Pipeline Adds

Each processed item gets an extraction dict:

{
    "url": "https://example.com/blog/post",
    "body": b"<html>...",
    "extraction": {
        "title": "Blog Post Title",
        "author": "John Doe",
        "date": "2026-01-15T00:00:00+00:00",
        "main_content": "The full extracted text...",
        "content_markdown": "# Blog Post Title\n\nThe full extracted text...",
        "page_type": "article",
        "extraction_quality": 0.95,
        "language": "en",
        "sitename": "Example Blog",
        "description": "A blog post about...",
    }
}
Enter fullscreen mode Exit fullscreen mode

Enabling Markdown Output

Add to settings.py:

RS_TRAFILATURA_MARKDOWN = True
Enter fullscreen mode Exit fullscreen mode

This populates item["extraction"]["content_markdown"] with GitHub Flavored Markdown.

Filtering by Page Type

The page type classification lets you route items differently based on what kind of page they are:

class ContentSpider(scrapy.Spider):
    name = "content"
    start_urls = ["https://example.com"]

    custom_settings = {
        "ITEM_PIPELINES": {
            "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
            "myproject.pipelines.PageTypeRouter": 400,
        },
    }

    def parse(self, response):
        yield {"url": response.url, "body": response.body}
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, self.parse)
Enter fullscreen mode Exit fullscreen mode
# myproject/pipelines.py
class PageTypeRouter:
    def process_item(self, item, spider):
        ext = item.get("extraction", {})
        page_type = ext.get("page_type", "article")

        if page_type == "product":
            # Save to products table
            save_product(item)
        elif page_type == "forum":
            # Save to discussions table
            save_forum_post(item)
        elif page_type == "article":
            # Save to articles table
            save_article(item)
        else:
            # Default handling
            save_generic(item)

        return item
Enter fullscreen mode Exit fullscreen mode

Filtering by Extraction Quality

Drop items where extraction quality is low:

class QualityFilter:
    def process_item(self, item, spider):
        ext = item.get("extraction", {})
        quality = ext.get("extraction_quality", 0)

        if quality < 0.5:
            raise scrapy.exceptions.DropItem(
                f"Low extraction quality ({quality:.2f}): {item['url']}"
            )

        return item
Enter fullscreen mode Exit fullscreen mode

Add it before the router:

ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
    "myproject.pipelines.QualityFilter": 350,
    "myproject.pipelines.PageTypeRouter": 400,
}
Enter fullscreen mode Exit fullscreen mode

Exporting to JSON Lines

Scrapy's built-in feed exports work out of the box:

scrapy crawl content -o output.jsonl
Enter fullscreen mode Exit fullscreen mode

Each line in output.jsonl will contain the full item including the extraction dict. You can then process it with any tool that reads JSON Lines.

Performance

rs-trafilatura extracts in ~44ms per page via compiled Rust (PyO3, no subprocess). On a typical Scrapy crawl, extraction adds negligible overhead compared to network latency. The pipeline processes items synchronously in the Scrapy reactor thread, but since extraction is CPU-bound and fast, it doesn't block the download pipeline.

For very high-throughput crawls (1000+ pages/second), consider running extraction in a separate process and communicating via Scrapy's item pipeline.

Items Without HTML

If an item doesn't have body or html, the pipeline passes it through unchanged:

# This item has no HTML — pipeline ignores it
yield {"url": response.url, "custom_data": "something"}
# → No "extraction" key added, item passes through as-is
Enter fullscreen mode Exit fullscreen mode

Links

Top comments (0)