Introduction
scrapy_cffi
is a lightweight async-first web scraping framework that follows a Scrapy-style architecture.
It is designed for developers who want a familiar crawling flow, but with full asyncio support, modular utilities, and flexible integration points.
The framework uses curl_cffi
as the default HTTP client—requests
-like API but more powerful—but the request layer is fully decoupled from the engine, allowing easy replacement with other HTTP libraries in the future.
Even if you don't need a full crawler, many of the utility libraries can be used independently.
💡 IDE-friendly: The framework emphasizes code completion, type hints, and programmatic settings creation, making development and debugging smoother in modern Python IDEs.
Why scrapy_cffi
?
scrapy_cffi
was designed with several core principles in mind:
- API-first & Modular: All spiders, pipelines, and tasks are fully accessible via Python interfaces. CLI is optional, and settings are generated programmatically to support both single and batch spider execution modes.
- Async Execution: Fully asyncio-based engine allows high concurrency, HTTP + WebSocket support, and smooth integration with async workflows.
- Scrapy-style Architecture: Spider flow, pipelines, and hooks resemble Scrapy, making it easy for existing Scrapy users to transition.
-
Decoupled Request Layer: By default,
curl_cffi
is used, but the scheduler and engine are independent of the HTTP client. This allows flexible swapping of request libraries without touching the crawler core. - Utility-first: Components like HTTP, WebSocket, media handling, JSON parsing, and database adapters can be used independently, supporting small scripts or full asynchronous crawlers alike.
✨ Features
- 🕸️ Scrapy-style components: spiders, items, pipelines, interceptors
- ⚡ Fully asyncio-based engine for high concurrency
- 🌐 HTTP & WebSocket support with TLS
- 🔔 Lightweight signal system
- 🔌 Plug-in ready interceptor & task manager
- 🗄️ Redis-compatible scheduler (optional)
- 💾 Built-in adapters for Redis, MySQL, and MongoDB with automatic retry & reconnection
🚀 Quick Start
# Install
pip install scrapy_cffi
# Create a new project
scrapy-cffi startproject myproject
cd myproject
# Generate a spider
scrapy-cffi genspider myspider example.com
# Run your crawler
python runner.py
Note: The CLI command changed from
scrapy_cffi
(≤0.1.4) to scrapy-cffi
(>0.1.4).
Becausescrapy_cffi
uses programmatic settings creation and API-first design, the framework does not rely on CLI for spider execution.
Full documentation: docs/
⭐ Star & contribute on GitHub: scrapy_cffi
⚡ Handy Utilities
scrapy_cffi
provides several async-first and utility-focused features that make crawling and async task orchestration easier:
Async Crawling
- Supports both
async def
async generators and Scrapy-style synchronous generators. - Fully asyncio-based execution with high concurrency.
ResultHolder
- Aggregate multiple request results before generating the next batch.
- Useful for multi-stage workflows and distributed tasks.
Hooks System
- Access sessions, scheduler, or other subsystems safely.
- Supports multi-user cookies and session rotation.
HTTP + WebSocket Requests
- Send HTTP & WebSocket requests in a single Spider.
- TLS support included.
- Advanced
curl_cffi
features: TLS/JA3 fingerprinting, proxy control, unified HTTP/WS API.
Request & Response Utilities
-
HttpRequest
/WebSocketRequest
with optional Protobuf & gRPC encoding. -
MediaRequest
for segmented downloads (videos, large files). -
HttpResponse
selector with.css()
,.xpath()
,.re()
. - Robust JSON extraction:
-
extract_json()
for standard JSON. -
extract_json_strong()
for malformed or embedded JSON.
-
- Protobuf / gRPC decoding from HTTP or WebSocket responses.
Database Support
Built-in adapters with automatic retry & reconnection:
-
RedisManager (
redis.asyncio.Redis
compatible) - SQLAlchemyMySQLManager (async SQLAlchemy engine & session, original API supported)
- MongoDBManager (async Motor client, native API supported)
Multi-process RPC with ProcessManager
scrapy_cffi
includes a lightweight ProcessManager for quick multi-process RPC registration.
This is ideal for small projects or debugging without relying on MQ/Redis, but not recommended for production.
- Supports function, class, and object registration for remote calls.
- Allows starting a server to expose registered methods and a client to connect and call them.
- Runs each registered callable in a separate process if needed, with optional result retrieval.
- Works cross-platform, but Windows has some Ctrl+C limitations due to process startup.
from scrapy_cffi.utils import ProcessManager
# Register methods
def hello(name: str):
return f"Hello, {name}!"
class Greeter:
def greet(self, msg: str):
return f"Greeting: {msg}"
class Counter:
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
def get(self):
return self.value
counter = Counter()
# Start server
manager = ProcessManager(register_methods={
"hello": hello,
"Greeter": Greeter,
"counter": counter
})
manager.start_server(run_mode=0) # blocking mode
# Start client
manager.start_client()
print(manager.hello("World"))
c = manager.counter()
print(c.inc())
g = manager.Greeter()
print(g.greet("Hi"))
Tip:
ProcessManager
is designed for rapid prototyping and small-scale tasks. For production-grade distributed systems, consider using a full-featured message queue or RPC framework.
scrapy_cffi
is currently in development. Its modular and API-first design allows developers to either use it as a full-fledged Scrapy-style framework or pick individual utilities for smaller, async-first scraping tasks. The ultimate goal is high flexibility, independent utilities, and easy extensibility for complex crawling projects.
Top comments (0)