DEV Community

Cover image for Async Web Scraping with scrapy_cffi
strange Funny
strange Funny

Posted on

Async Web Scraping with scrapy_cffi

Introduction

scrapy_cffi is a lightweight async-first web scraping framework that follows a Scrapy-style architecture.

It is designed for developers who want a familiar crawling flow, but with full asyncio support, modular utilities, and flexible integration points.

The framework uses curl_cffi as the default HTTP client—requests-like API but more powerful—but the request layer is fully decoupled from the engine, allowing easy replacement with other HTTP libraries in the future.

Even if you don't need a full crawler, many of the utility libraries can be used independently.

💡 IDE-friendly: The framework emphasizes code completion, type hints, and programmatic settings creation, making development and debugging smoother in modern Python IDEs.


Why scrapy_cffi?

scrapy_cffi was designed with several core principles in mind:

  • API-first & Modular: All spiders, pipelines, and tasks are fully accessible via Python interfaces. CLI is optional, and settings are generated programmatically to support both single and batch spider execution modes.
  • Async Execution: Fully asyncio-based engine allows high concurrency, HTTP + WebSocket support, and smooth integration with async workflows.
  • Scrapy-style Architecture: Spider flow, pipelines, and hooks resemble Scrapy, making it easy for existing Scrapy users to transition.
  • Decoupled Request Layer: By default, curl_cffi is used, but the scheduler and engine are independent of the HTTP client. This allows flexible swapping of request libraries without touching the crawler core.
  • Utility-first: Components like HTTP, WebSocket, media handling, JSON parsing, and database adapters can be used independently, supporting small scripts or full asynchronous crawlers alike.

✨ Features

  • 🕸️ Scrapy-style components: spiders, items, pipelines, interceptors
  • ⚡ Fully asyncio-based engine for high concurrency
  • 🌐 HTTP & WebSocket support with TLS
  • 🔔 Lightweight signal system
  • 🔌 Plug-in ready interceptor & task manager
  • 🗄️ Redis-compatible scheduler (optional)
  • 💾 Built-in adapters for Redis, MySQL, and MongoDB with automatic retry & reconnection

🚀 Quick Start

# Install
pip install scrapy_cffi

# Create a new project
scrapy-cffi startproject myproject
cd myproject

# Generate a spider
scrapy-cffi genspider myspider example.com

# Run your crawler
python runner.py

Enter fullscreen mode Exit fullscreen mode

Note: The CLI command changed from scrapy_cffi (≤0.1.4) to scrapy-cffi (>0.1.4).
Because scrapy_cffi uses programmatic settings creation and API-first design, the framework does not rely on CLI for spider execution.

Full documentation: docs/

⭐ Star & contribute on GitHub: scrapy_cffi

⚡ Handy Utilities

scrapy_cffi provides several async-first and utility-focused features that make crawling and async task orchestration easier:

Async Crawling

  • Supports both async def async generators and Scrapy-style synchronous generators.
  • Fully asyncio-based execution with high concurrency.

ResultHolder

  • Aggregate multiple request results before generating the next batch.
  • Useful for multi-stage workflows and distributed tasks.

Hooks System

  • Access sessions, scheduler, or other subsystems safely.
  • Supports multi-user cookies and session rotation.

HTTP + WebSocket Requests

  • Send HTTP & WebSocket requests in a single Spider.
  • TLS support included.
  • Advanced curl_cffi features: TLS/JA3 fingerprinting, proxy control, unified HTTP/WS API.

Request & Response Utilities

  • HttpRequest / WebSocketRequest with optional Protobuf & gRPC encoding.
  • MediaRequest for segmented downloads (videos, large files).
  • HttpResponse selector with .css(), .xpath(), .re().
  • Robust JSON extraction:
    • extract_json() for standard JSON.
    • extract_json_strong() for malformed or embedded JSON.
  • Protobuf / gRPC decoding from HTTP or WebSocket responses.

Database Support

Built-in adapters with automatic retry & reconnection:

  • RedisManager (redis.asyncio.Redis compatible)
  • SQLAlchemyMySQLManager (async SQLAlchemy engine & session, original API supported)
  • MongoDBManager (async Motor client, native API supported)

MongoDB & MySQL usage examples:
MongoDB
MySQL

Multi-process RPC with ProcessManager

scrapy_cffi includes a lightweight ProcessManager for quick multi-process RPC registration.

This is ideal for small projects or debugging without relying on MQ/Redis, but not recommended for production.

  • Supports function, class, and object registration for remote calls.
  • Allows starting a server to expose registered methods and a client to connect and call them.
  • Runs each registered callable in a separate process if needed, with optional result retrieval.
  • Works cross-platform, but Windows has some Ctrl+C limitations due to process startup.
from scrapy_cffi.utils import ProcessManager

# Register methods
def hello(name: str):
    return f"Hello, {name}!"

class Greeter:
    def greet(self, msg: str):
        return f"Greeting: {msg}"

class Counter:
    def __init__(self):
        self.value = 0
    def inc(self):
        self.value += 1
        return self.value
    def get(self):
        return self.value

counter = Counter()

# Start server
manager = ProcessManager(register_methods={
    "hello": hello,
    "Greeter": Greeter,
    "counter": counter
})
manager.start_server(run_mode=0)  # blocking mode

# Start client
manager.start_client()
print(manager.hello("World"))
c = manager.counter()
print(c.inc())
g = manager.Greeter()
print(g.greet("Hi"))
Enter fullscreen mode Exit fullscreen mode

Tip: ProcessManager is designed for rapid prototyping and small-scale tasks. For production-grade distributed systems, consider using a full-featured message queue or RPC framework.

scrapy_cffi is currently in development. Its modular and API-first design allows developers to either use it as a full-fledged Scrapy-style framework or pick individual utilities for smaller, async-first scraping tasks. The ultimate goal is high flexibility, independent utilities, and easy extensibility for complex crawling projects.

Top comments (0)