Murrough Foley

Posted on Apr 3

How to Use rs-trafilatura with Scrapy

#webcontentextraction #rust #scrapy

Scrapy is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with HTML, and the pipeline adds structured extraction results automatically.

Install

pip install rs-trafilatura scrapy

Setup

Add the pipeline to your Scrapy project's settings.py:

ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}

That's it. Every item that passes through the pipeline with a body (bytes) or html (string) field will get an extraction dict added to it.

Writing the Spider

Your spider yields items with the response body and URL:

import scrapy

class ContentSpider(scrapy.Spider):
    name = "content"
    start_urls = ["https://example.com"]

    def parse(self, response):
        yield {
            "url": response.url,
            "body": response.body,  # raw bytes — rs-trafilatura auto-detects encoding
        }

        # Follow links
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, self.parse)

The pipeline picks up body (bytes) or html (string). When it finds one, it runs extraction and adds the results under item["extraction"].

What the Pipeline Adds

Each processed item gets an extraction dict:

{
    "url": "https://example.com/blog/post",
    "body": b"<html>...",
    "extraction": {
        "title": "Blog Post Title",
        "author": "John Doe",
        "date": "2026-01-15T00:00:00+00:00",
        "main_content": "The full extracted text...",
        "content_markdown": "# Blog Post Title\n\nThe full extracted text...",
        "page_type": "article",
        "extraction_quality": 0.95,
        "language": "en",
        "sitename": "Example Blog",
        "description": "A blog post about...",
    }
}

Enabling Markdown Output

Add to settings.py:

RS_TRAFILATURA_MARKDOWN = True

This populates item["extraction"]["content_markdown"] with GitHub Flavored Markdown.

Filtering by Page Type

The page type classification lets you route items differently based on what kind of page they are:

class ContentSpider(scrapy.Spider):
    name = "content"
    start_urls = ["https://example.com"]

    custom_settings = {
        "ITEM_PIPELINES": {
            "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
            "myproject.pipelines.PageTypeRouter": 400,
        },
    }

    def parse(self, response):
        yield {"url": response.url, "body": response.body}
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, self.parse)

# myproject/pipelines.py
class PageTypeRouter:
    def process_item(self, item, spider):
        ext = item.get("extraction", {})
        page_type = ext.get("page_type", "article")

        if page_type == "product":
            # Save to products table
            save_product(item)
        elif page_type == "forum":
            # Save to discussions table
            save_forum_post(item)
        elif page_type == "article":
            # Save to articles table
            save_article(item)
        else:
            # Default handling
            save_generic(item)

        return item

Filtering by Extraction Quality

Drop items where extraction quality is low:

class QualityFilter:
    def process_item(self, item, spider):
        ext = item.get("extraction", {})
        quality = ext.get("extraction_quality", 0)

        if quality < 0.5:
            raise scrapy.exceptions.DropItem(
                f"Low extraction quality ({quality:.2f}): {item['url']}"
            )

        return item

Add it before the router:

ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
    "myproject.pipelines.QualityFilter": 350,
    "myproject.pipelines.PageTypeRouter": 400,
}

Exporting to JSON Lines

Scrapy's built-in feed exports work out of the box:

scrapy crawl content -o output.jsonl

Each line in output.jsonl will contain the full item including the extraction dict. You can then process it with any tool that reads JSON Lines.

Performance

rs-trafilatura extracts in ~44ms per page via compiled Rust (PyO3, no subprocess). On a typical Scrapy crawl, extraction adds negligible overhead compared to network latency. The pipeline processes items synchronously in the Scrapy reactor thread, but since extraction is CPU-bound and fast, it doesn't block the download pipeline.

For very high-throughput crawls (1000+ pages/second), consider running extraction in a separate process and communicating via Scrapy's item pipeline.

Items Without HTML

If an item doesn't have body or html, the pipeline passes it through unchanged:

# This item has no HTML — pipeline ignores it
yield {"url": response.url, "custom_data": "something"}
# → No "extraction" key added, item passes through as-is

DEV Community