Async Web Scraping with scrapy_cffi

#python #webscraping #asyncio #opensource

Introduction

scrapy_cffi is a lightweight async-first web scraping framework that follows a Scrapy-style architecture.

It is designed for developers who want a familiar crawling flow, but with full asyncio support, modular utilities, and flexible integration points.

The framework uses curl_cffi as the default HTTP client—requests-like API but more powerful—but the request layer is fully decoupled from the engine, allowing easy replacement with other HTTP libraries in the future.

Even if you don't need a full crawler, many of the utility libraries can be used independently.

💡 IDE-friendly: The framework emphasizes code completion, type hints, and programmatic settings creation, making development and debugging smoother in modern Python IDEs.

Why `scrapy_cffi`?

scrapy_cffi was designed with several core principles in mind:

API-first & Modular: All spiders, pipelines, and tasks are fully accessible via Python interfaces. CLI is optional, and settings are generated programmatically to support both single and batch spider execution modes.
Async Execution: Fully asyncio-based engine allows high concurrency, HTTP + WebSocket support, and smooth integration with async workflows.
Scrapy-style Architecture: Spider flow, pipelines, and hooks resemble Scrapy, making it easy for existing Scrapy users to transition.
Decoupled Request Layer: By default, curl_cffi is used, but the scheduler and engine are independent of the HTTP client. This allows flexible swapping of request libraries without touching the crawler core.
Utility-first: Components like HTTP, WebSocket, media handling, JSON parsing, and database adapters can be used independently, supporting small scripts or full asynchronous crawlers alike.

✨ Features

🕸️ Scrapy-style components: spiders, items, pipelines, interceptors
⚡ Fully asyncio-based engine for high concurrency
🌐 HTTP & WebSocket support with TLS
🔔 Lightweight signal system
🔌 Plug-in ready interceptor & task manager
🗄️ Redis-compatible scheduler (optional)
💾 Built-in adapters for Redis, MySQL, and MongoDB with automatic retry & reconnection

🚀 Quick Start

# Install
pip install scrapy_cffi

# Create a new project
scrapy-cffi startproject myproject
cd myproject

# Generate a spider
scrapy-cffi genspider myspider example.com

# Run your crawler
python runner.py

Note: The CLI command changed from scrapy_cffi (≤0.1.4) to scrapy-cffi (>0.1.4).
Because scrapy_cffi uses programmatic settings creation and API-first design, the framework does not rely on CLI for spider execution.

Full documentation: docs/

⭐ Star & contribute on GitHub: scrapy_cffi

⚡ Handy Utilities

scrapy_cffi provides several async-first and utility-focused features that make crawling and async task orchestration easier:

Async Crawling

Supports both async def async generators and Scrapy-style synchronous generators.
Fully asyncio-based execution with high concurrency.

ResultHolder

Aggregate multiple request results before generating the next batch.
Useful for multi-stage workflows and distributed tasks.

Hooks System

Access sessions, scheduler, or other subsystems safely.
Supports multi-user cookies and session rotation.

HTTP + WebSocket Requests

Send HTTP & WebSocket requests in a single Spider.
TLS support included.
Advanced curl_cffi features: TLS/JA3 fingerprinting, proxy control, unified HTTP/WS API.

Request & Response Utilities

HttpRequest / WebSocketRequest with optional Protobuf & gRPC encoding.
MediaRequest for segmented downloads (videos, large files).
HttpResponse selector with .css(), .xpath(), .re().
Robust JSON extraction:
- extract_json() for standard JSON.
- extract_json_strong() for malformed or embedded JSON.
Protobuf / gRPC decoding from HTTP or WebSocket responses.

Database Support

Built-in adapters with automatic retry & reconnection:

RedisManager (redis.asyncio.Redis compatible)
SQLAlchemyMySQLManager (async SQLAlchemy engine & session, original API supported)
MongoDBManager (async Motor client, native API supported)

MongoDB & MySQL usage examples:
MongoDB
MySQL

Multi-process RPC with `ProcessManager`

scrapy_cffi includes a lightweight ProcessManager for quick multi-process RPC registration.

This is ideal for small projects or debugging without relying on MQ/Redis, but not recommended for production.

Supports function, class, and object registration for remote calls.
Allows starting a server to expose registered methods and a client to connect and call them.
Runs each registered callable in a separate process if needed, with optional result retrieval.
Works cross-platform, but Windows has some Ctrl+C limitations due to process startup.

from scrapy_cffi.utils import ProcessManager

# Register methods
def hello(name: str):
    return f"Hello, {name}!"

class Greeter:
    def greet(self, msg: str):
        return f"Greeting: {msg}"

class Counter:
    def __init__(self):
        self.value = 0
    def inc(self):
        self.value += 1
        return self.value
    def get(self):
        return self.value

counter = Counter()

# Start server
manager = ProcessManager(register_methods={
    "hello": hello,
    "Greeter": Greeter,
    "counter": counter
})
manager.start_server(run_mode=0)  # blocking mode

# Start client
manager.start_client()
print(manager.hello("World"))
c = manager.counter()
print(c.inc())
g = manager.Greeter()
print(g.greet("Hi"))

Tip: ProcessManager is designed for rapid prototyping and small-scale tasks. For production-grade distributed systems, consider using a full-featured message queue or RPC framework.

scrapy_cffi is currently in development. Its modular and API-first design allows developers to either use it as a full-fledged Scrapy-style framework or pick individual utilities for smaller, async-first scraping tasks. The ultimate goal is high flexibility, independent utilities, and easy extensibility for complex crawling projects.