<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: strange Funny</title>
    <description>The latest articles on DEV Community by strange Funny (@strange_funny_ca3065432c5).</description>
    <link>https://dev.to/strange_funny_ca3065432c5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3498828%2F267fc040-50de-47c9-b4b0-a32e9d63bc29.png</url>
      <title>DEV Community: strange Funny</title>
      <link>https://dev.to/strange_funny_ca3065432c5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/strange_funny_ca3065432c5"/>
    <language>en</language>
    <item>
      <title>Async Web Scraping with scrapy_cffi</title>
      <dc:creator>strange Funny</dc:creator>
      <pubDate>Sat, 13 Sep 2025 06:10:40 +0000</pubDate>
      <link>https://dev.to/strange_funny_ca3065432c5/async-web-scraping-with-scrapycffi-1em3</link>
      <guid>https://dev.to/strange_funny_ca3065432c5/async-web-scraping-with-scrapycffi-1em3</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;scrapy_cffi&lt;/code&gt; is a lightweight &lt;strong&gt;async-first web scraping framework&lt;/strong&gt; that follows a &lt;strong&gt;Scrapy-style architecture&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
It is designed for developers who want a familiar crawling flow, but with &lt;strong&gt;full asyncio support&lt;/strong&gt;, modular utilities, and flexible integration points.  &lt;/p&gt;

&lt;p&gt;The framework uses &lt;code&gt;curl_cffi&lt;/code&gt; as the default HTTP client—&lt;code&gt;requests&lt;/code&gt;-like API but more powerful—but the request layer is &lt;strong&gt;fully decoupled&lt;/strong&gt; from the engine, allowing easy replacement with other HTTP libraries in the future.  &lt;/p&gt;

&lt;p&gt;Even if you don't need a full crawler, many of the utility libraries can be used independently.  &lt;/p&gt;

&lt;p&gt;💡 &lt;strong&gt;IDE-friendly&lt;/strong&gt;: The framework emphasizes code completion, type hints, and programmatic settings creation, making development and debugging smoother in modern Python IDEs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why &lt;code&gt;scrapy_cffi&lt;/code&gt;?
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;scrapy_cffi&lt;/code&gt; was designed with several core principles in mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API-first &amp;amp; Modular&lt;/strong&gt;: All spiders, pipelines, and tasks are fully accessible via Python interfaces. CLI is optional, and settings are generated programmatically to support both single and batch spider execution modes.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async Execution&lt;/strong&gt;: Fully asyncio-based engine allows high concurrency, HTTP + WebSocket support, and smooth integration with async workflows.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scrapy-style Architecture&lt;/strong&gt;: Spider flow, pipelines, and hooks resemble Scrapy, making it easy for existing Scrapy users to transition.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decoupled Request Layer&lt;/strong&gt;: By default, &lt;code&gt;curl_cffi&lt;/code&gt; is used, but the scheduler and engine are independent of the HTTP client. This allows flexible swapping of request libraries without touching the crawler core.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Utility-first&lt;/strong&gt;: Components like HTTP, WebSocket, media handling, JSON parsing, and database adapters can be used independently, supporting small scripts or full asynchronous crawlers alike.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ✨ Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🕸️ Scrapy-style components: spiders, items, pipelines, interceptors
&lt;/li&gt;
&lt;li&gt;⚡ Fully asyncio-based engine for high concurrency
&lt;/li&gt;
&lt;li&gt;🌐 HTTP &amp;amp; WebSocket support with TLS
&lt;/li&gt;
&lt;li&gt;🔔 Lightweight signal system
&lt;/li&gt;
&lt;li&gt;🔌 Plug-in ready interceptor &amp;amp; task manager
&lt;/li&gt;
&lt;li&gt;🗄️ Redis-compatible scheduler (optional)
&lt;/li&gt;
&lt;li&gt;💾 Built-in adapters for &lt;strong&gt;Redis, MySQL, and MongoDB&lt;/strong&gt; with automatic retry &amp;amp; reconnection
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚀 Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy_cffi

&lt;span class="c"&gt;# Create a new project&lt;/span&gt;
scrapy-cffi startproject myproject
&lt;span class="nb"&gt;cd &lt;/span&gt;myproject

&lt;span class="c"&gt;# Generate a spider&lt;/span&gt;
scrapy-cffi genspider myspider example.com

&lt;span class="c"&gt;# Run your crawler&lt;/span&gt;
python runner.py

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Note: The CLI command changed from &lt;code&gt;scrapy_cffi&lt;/code&gt; (≤0.1.4) to scrapy-&lt;code&gt;cffi&lt;/code&gt; (&amp;gt;0.1.4).&lt;br&gt;
Because &lt;code&gt;scrapy_cffi&lt;/code&gt; uses programmatic settings creation and API-first design, the framework does not rely on CLI for spider execution.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Full documentation: &lt;a href="https://github.com/aFunnyStrange/scrapy_cffi/tree/main/docs/usage" rel="noopener noreferrer"&gt;&lt;code&gt;docs/&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⭐ Star &amp;amp; contribute on GitHub: &lt;a href="https://github.com/aFunnyStrange/scrapy_cffi" rel="noopener noreferrer"&gt;&lt;code&gt;scrapy_cffi&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚡ Handy Utilities
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;scrapy_cffi&lt;/code&gt; provides several &lt;strong&gt;async-first and utility-focused features&lt;/strong&gt; that make crawling and async task orchestration easier:&lt;/p&gt;

&lt;h3&gt;
  
  
  Async Crawling
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Supports both &lt;code&gt;async def&lt;/code&gt; async generators and Scrapy-style synchronous generators.&lt;/li&gt;
&lt;li&gt;Fully asyncio-based execution with high concurrency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ResultHolder
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Aggregate multiple request results before generating the next batch.&lt;/li&gt;
&lt;li&gt;Useful for multi-stage workflows and distributed tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hooks System
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Access sessions, scheduler, or other subsystems safely.&lt;/li&gt;
&lt;li&gt;Supports multi-user cookies and session rotation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  HTTP + WebSocket Requests
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Send HTTP &amp;amp; WebSocket requests in a single Spider.&lt;/li&gt;
&lt;li&gt;TLS support included.&lt;/li&gt;
&lt;li&gt;Advanced &lt;code&gt;curl_cffi&lt;/code&gt; features: TLS/JA3 fingerprinting, proxy control, unified HTTP/WS API.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Request &amp;amp; Response Utilities
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;HttpRequest&lt;/code&gt; / &lt;code&gt;WebSocketRequest&lt;/code&gt; with optional Protobuf &amp;amp; gRPC encoding.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MediaRequest&lt;/code&gt; for segmented downloads (videos, large files).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;HttpResponse&lt;/code&gt; selector with &lt;code&gt;.css()&lt;/code&gt;, &lt;code&gt;.xpath()&lt;/code&gt;, &lt;code&gt;.re()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Robust JSON extraction:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;extract_json()&lt;/code&gt; for standard JSON.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;extract_json_strong()&lt;/code&gt; for malformed or embedded JSON.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Protobuf / gRPC decoding from HTTP or WebSocket responses.&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Database Support
&lt;/h3&gt;

&lt;p&gt;Built-in adapters with automatic retry &amp;amp; reconnection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RedisManager&lt;/strong&gt; (&lt;code&gt;redis.asyncio.Redis&lt;/code&gt; compatible)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQLAlchemyMySQLManager&lt;/strong&gt; (async SQLAlchemy engine &amp;amp; session, original API supported)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDBManager&lt;/strong&gt; (async Motor client, native API supported)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;MongoDB &amp;amp; MySQL usage examples:&lt;br&gt;
&lt;a href="https://github.com/aFunnyStrange/scrapy_cffi/blob/main/tests/test_mongodb.py" rel="noopener noreferrer"&gt;&lt;code&gt;MongoDB&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/aFunnyStrange/scrapy_cffi/blob/main/tests/test_mysql.py" rel="noopener noreferrer"&gt;&lt;code&gt;MySQL&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Multi-process RPC with &lt;code&gt;ProcessManager&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;scrapy_cffi&lt;/code&gt; includes a lightweight &lt;strong&gt;ProcessManager&lt;/strong&gt; for quick multi-process RPC registration.&lt;br&gt;&lt;br&gt;
This is ideal for &lt;strong&gt;small projects or debugging&lt;/strong&gt; without relying on MQ/Redis, but &lt;strong&gt;not recommended for production&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supports &lt;strong&gt;function, class, and object registration&lt;/strong&gt; for remote calls.&lt;/li&gt;
&lt;li&gt;Allows starting a &lt;strong&gt;server&lt;/strong&gt; to expose registered methods and a &lt;strong&gt;client&lt;/strong&gt; to connect and call them.&lt;/li&gt;
&lt;li&gt;Runs each registered callable in a separate process if needed, with optional result retrieval.&lt;/li&gt;
&lt;li&gt;Works cross-platform, but Windows has some Ctrl+C limitations due to process startup.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy_cffi.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProcessManager&lt;/span&gt;

&lt;span class="c1"&gt;# Register methods
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hello&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Greeter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;greet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Greeting: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;

&lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Start server
&lt;/span&gt;&lt;span class="n"&gt;manager&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProcessManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;register_methods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hello&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Greeter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Greeter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;counter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# blocking mode
&lt;/span&gt;
&lt;span class="c1"&gt;# Start client
&lt;/span&gt;&lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hello&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;World&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Greeter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;greet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;: &lt;code&gt;ProcessManager&lt;/code&gt; is designed for &lt;strong&gt;rapid prototyping and small-scale tasks&lt;/strong&gt;. For production-grade distributed systems, consider using a full-featured message queue or RPC framework.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;scrapy_cffi&lt;/code&gt; is currently in development. Its &lt;strong&gt;modular and API-first design&lt;/strong&gt; allows developers to either use it as a full-fledged Scrapy-style framework or pick individual utilities for smaller, async-first scraping tasks. The ultimate goal is &lt;strong&gt;high flexibility, independent utilities, and easy extensibility&lt;/strong&gt; for complex crawling projects.&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>asyncio</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
