<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Extract by Zyte</title>
    <description>The latest articles on DEV Community by Extract by Zyte (@extractdata).</description>
    <link>https://dev.to/extractdata</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F11159%2F9b0ab14b-3550-4e5e-b996-02b33c0912fa.jpg</url>
      <title>DEV Community: Extract by Zyte</title>
      <link>https://dev.to/extractdata</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/extractdata"/>
    <language>en</language>
    <item>
      <title>What the Scrapy Maintainer Thinks About AI-Generated Scrapers</title>
      <dc:creator>Neha Setia</dc:creator>
      <pubDate>Sat, 11 Apr 2026 15:13:41 +0000</pubDate>
      <link>https://dev.to/extractdata/what-the-scrapy-maintainer-thinks-about-ai-generated-scrapers-24c0</link>
      <guid>https://dev.to/extractdata/what-the-scrapy-maintainer-thinks-about-ai-generated-scrapers-24c0</guid>
      <description>&lt;p&gt;I sat down with Adrian Chaves, one of the lead &lt;a href="https://www.scrapy.org/" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt; maintainers, who also works at Zyte, to ask him the questions I've been chewing on since Zyte launched Web Scraping Copilot: &lt;strong&gt;what happens when an LLM writes your spider(the web scraping code)? What gets easier? What doesn't change?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;His answers surprised me. A few highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;On vibe coding:&lt;/strong&gt; Adrian has thoughts about developers treating scraper generation as a black box, and why Scrapy's design philosophy matters more, not less, when an LLM is writing the code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The bottleneck isn't what you think.&lt;/strong&gt; He argues the hard part of scraping in 2026 isn't writing code. It's reading pages. And that's the part LLMs still struggle with.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What "good design meeting the future halfway" means.&lt;/strong&gt; Why frameworks like Scrapy that were built for humans are turning out to be the best frameworks for AI agents too.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Where LLMs actually help.&lt;/strong&gt; The concrete places where AI makes a scraper developer's life better, and where it just adds complexity.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full conversation is on the Zyte blog, linked below. If you're building scrapers, thinking about adding AI to your extraction pipeline, or just curious what someone who's been maintaining one of the most widely used scraping frameworks for years thinks about all of this, it's worth a read.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.zyte.com/blog/adrian-chaves-web-scraping-copilot-interview/" rel="noopener noreferrer"&gt;Read the full interview on zyte.com&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Happy to discuss in the comments. &lt;br&gt;
&lt;strong&gt;What are you using AI for in your scraping workflow right now, and where have you hit walls?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tags: web scraping • scrapy • ai • opus • anti-bot • Claude AI • sonnet • open source&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>claude</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Stop using Python `requests` for web scraping: there are better &amp; modern libraries instead</title>
      <dc:creator>Ayan Pahwa</dc:creator>
      <pubDate>Thu, 09 Apr 2026 11:09:31 +0000</pubDate>
      <link>https://dev.to/extractdata/stop-using-python-requests-for-web-scraping-there-are-better-modern-libraries-instead-500d</link>
      <guid>https://dev.to/extractdata/stop-using-python-requests-for-web-scraping-there-are-better-modern-libraries-instead-500d</guid>
      <description>&lt;p&gt;While the 'Requests' library remains the default choice for many Python developers due to its reliability and extensive documentation, the Python HTTP landscape has evolved considerably. &lt;/p&gt;

&lt;p&gt;Modern alternatives now offer significant advantages, including built-in asynchronous support, HTTP/2 compatibility, enhanced performance, and up-to-date TLS handling. &lt;/p&gt;

&lt;p&gt;This article introduces and compares three such contemporary clients: HTTPX, curl_cffi, and rnet, detailing their unique features and practical applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with Requests for web scraping
&lt;/h2&gt;

&lt;p&gt;It's important to clarify Requests' limitations before proceeding; for simple API interactions with well-behaved endpoints, it still remains the de facto standard.&lt;/p&gt;

&lt;p&gt;However, a major drawback of the Requests library when it comes to web scraping is its predictable HTTP client fingerprint. This fingerprint, a unique combination of TLS version, cipher suites, HTTP headers, and connection characteristics, is sent with every request, and is well-known and cataloged by anti-bot systems. &lt;/p&gt;

&lt;p&gt;Consequently, if you're interacting with any endpoint, including APIs or services protected by anti-ban vendors, your request can be blocked purely based on &lt;em&gt;how&lt;/em&gt; the &lt;code&gt;requests&lt;/code&gt; library identifies itself. This happens even &lt;em&gt;before&lt;/em&gt; your credentials or payload are scrutinized, highlighting a significant limitation when targeting systems that perform client-side validation.&lt;/p&gt;

&lt;p&gt;In addition to issues like fingerprinting, a major limitation of the &lt;code&gt;requests&lt;/code&gt; library is its lack of native asynchronous support. This absence of async capability is particularly problematic when handling workloads that involve numerous HTTP &lt;code&gt;requests&lt;/code&gt;. Without it, the calls execute sequentially, and the program's thread remains blocked for the entire duration of each individual request.&lt;/p&gt;

&lt;p&gt;For straightforward scenarios, the standard &lt;code&gt;requests&lt;/code&gt; API call remains perfectly functional, as demonstrated in a quick example.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://jsonplaceholder.typicode.com/posts/1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clean and simple. For a one-off call to a standard REST API, this is fine. The gaps start showing when you need concurrency, HTTP/2, or when the target endpoint does any kind of client validation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install the Alternatives
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;httpx       or  uv add https
pip &lt;span class="nb"&gt;install &lt;/span&gt;curl-cffi       or  uv add curl-cffi
pip &lt;span class="nb"&gt;install &lt;/span&gt;rnet        or  uv add rnet &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
                    uv add asyncio
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  1. HTTPX
&lt;/h2&gt;

&lt;p&gt;HTTPX is the most direct upgrade from Requests as the API is nearly identical. If you know Requests, you already know most of HTTPX. What it adds is first-class async support, HTTP/2, and a more modern internal architecture.&lt;/p&gt;

&lt;p&gt;Where it differs from Requests is the explicit use of a &lt;code&gt;Client&lt;/code&gt; context manager (strongly recommended over module-level function calls) and the &lt;code&gt;AsyncClient&lt;/code&gt; for async usage. This gives you connection pooling and proper resource cleanup by default.&lt;/p&gt;

&lt;p&gt;HTTPX is the right starting point if you're looking for a migration that requires minimal code changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: Sync
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://jsonplaceholder.typicode.com/posts/1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example: Async (calling the Zyte API)
&lt;/h3&gt;

&lt;p&gt;Async is where HTTPX really earns its keep. Here it's used to fire multiple requests to the Zyte API concurrently, each request blocks on the server side until extraction is complete, but your event loop stays free to send others in parallel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;ENDPOINT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.zyte.com/v1/extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://httpbin.org&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browserHtml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;60.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browserHtml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;raise_for_status()&lt;/code&gt; raises &lt;code&gt;httpx.HTTPStatusError&lt;/code&gt; on 4xx/5xx responses.
&lt;/li&gt;
&lt;li&gt;HTTP/2 support requires &lt;code&gt;pip install httpx[http2]&lt;/code&gt; and passing &lt;code&gt;http2=True&lt;/code&gt; to the client.
&lt;/li&gt;
&lt;li&gt;The 60-second timeout accounts for the Zyte API's server-side blocking behavior — it holds the connection open until extraction completes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. curl_cffi
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;curl_cffi&lt;/code&gt; wraps &lt;code&gt;libcurl&lt;/code&gt; with Python bindings and adds something HTTPX doesn't have: TLS fingerprint impersonation. It can show the TLS handshake of Chrome, Firefox, Safari, and other browsers. For API calls hitting endpoints protected by anti-ban or similar systems, this can be the difference between getting a response and getting a 403.&lt;/p&gt;

&lt;p&gt;The interface closely mirrors Requests, with the addition of the impersonate parameter. It supports both sync and async usage. For most API calls where fingerprinting isn't a concern, curl_cffi behaves just like Requests, the impersonate parameter is opt-in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: Sync
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;curl_cffi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://jsonplaceholder.typicode.com/posts/1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;impersonate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chrome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example: Async (calling the Zyte API)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;curl_cffi.requests&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt;

&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;ENDPOINT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.zyte.com/v1/extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browserHtml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_zyte_api&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;impersonate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chrome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browserHtml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;call_zyte_api&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;impersonate="chrome"&lt;/code&gt; sends Chrome's TLS fingerprint on every request made through this session.
&lt;/li&gt;
&lt;li&gt;Other supported values include &lt;code&gt;"firefox", "safari", "chrome110"&lt;/code&gt;, and more — check the &lt;code&gt;curl-cffi&lt;/code&gt; docs for the full list.
&lt;/li&gt;
&lt;li&gt;The sync interface (&lt;code&gt;from curl_cffi import requests&lt;/code&gt;) is nearly identical to the &lt;code&gt;requests&lt;/code&gt; module, making it the easiest drop-in if you only need sync.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. rnet
&lt;/h2&gt;

&lt;p&gt;rnet is the newest of the three. Like a lot of modern Python, it's built on Rust, making it async-first and performance-oriented. Like curl_cffi, it supports TLS impersonation, but its primary differentiator is throughput. It is designed for high-concurrency workloads where you're firing many requests simultaneously.&lt;/p&gt;

&lt;p&gt;The API surface is different from Requests, so it's not a drop-in replacement. But the patterns are clean and modern, and for async-heavy workloads it's worth the minor adjustment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: Sample library code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rnet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Impersonate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Build a client
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;impersonate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Impersonate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Firefox139&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Use the API you're already familiar with
&lt;/span&gt;    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tls.peet.ws/api/all&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Print the response
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;rnet&lt;/code&gt; is async-first; sync support is limited.
&lt;/li&gt;
&lt;li&gt;Response body methods like .json() and .text() are awaitable.
&lt;/li&gt;
&lt;li&gt;The Rust core makes it particularly well-suited for high-throughput concurrent workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Requests&lt;/th&gt;
&lt;th&gt;HTTPX&lt;/th&gt;
&lt;th&gt;curl_cffi&lt;/th&gt;
&lt;th&gt;rnet&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sync Support&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;⚠️ Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async support&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes (primary)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTTP/2&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ With extra dependencies&lt;/td&gt;
&lt;td&gt;✅ Via libcurl&lt;/td&gt;
&lt;td&gt;✅ Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Good–High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TLS changes&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When to use which
&lt;/h2&gt;

&lt;p&gt;Use Requests for simple, one-off scripts, internal tooling, or any situation where you're hitting a cooperative API endpoint and don't need concurrency. Nothing wrong with it in that context.&lt;/p&gt;

&lt;p&gt;Use HTTPX when you need async, want the closest migration path from Requests, or need HTTP/2. It's the safest default upgrade for most projects.&lt;/p&gt;

&lt;p&gt;Use curl_cffi when TLS fingerprint control matters, whether that's because you're hitting an anti-ban wall or an API with strict client validation, or any service that checks how a client identifies itself at the TLS layer.&lt;/p&gt;

&lt;p&gt;Use rnet when raw async performance is the priority. Its Rust foundation makes it the strongest choice for high-concurrency workloads where you're firing many requests simultaneously and need low overhead.&lt;/p&gt;

&lt;p&gt;The optimal choice is determined by several factors: your concurrency requirements, the target endpoint's sensitivity to client identification, and the desired similarity between the new code and your existing &lt;code&gt;requests&lt;/code&gt; implementation.&lt;/p&gt;

</description>
      <category>python</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Writing production-ready Scrapy spiders with opencode</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 08 Apr 2026 09:19:39 +0000</pubDate>
      <link>https://dev.to/extractdata/writing-production-ready-scrapy-spiders-with-opencode-ea2</link>
      <guid>https://dev.to/extractdata/writing-production-ready-scrapy-spiders-with-opencode-ea2</guid>
      <description>&lt;p&gt;AI-enabled code editors can now conjure scraping code on command. But anyone who has used a generic coding agent to build a spider knows what comes next: a plausible-looking file that falls apart the moment it hits a real website. The selectors are fragile, the error handling is missing, and the structure ignores everything Scrapy actually expects from production code.&lt;/p&gt;

&lt;p&gt;The problem is not the AI. It's the prompts, the context, and knowing where to let the agent drive and where to stay in control. This article walks through using &lt;a href="https://opencode.ai" rel="noopener noreferrer"&gt;opencode&lt;/a&gt; to build Scrapy spiders that are actually deployable, covering setup, the prompts that work, and the pitfalls that will burn you if you are not careful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx2ycq5iahc2vpvushw9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx2ycq5iahc2vpvushw9.png" alt="building spiders with opencode"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why opencode works well for scraping projects
&lt;/h2&gt;

&lt;p&gt;Most AI coding agents are designed around general-purpose software projects. opencode is different in one important way: it is terminal-native, model-agnostic, and designed to operate inside your actual working directory. It reads your project, understands your file structure, and writes code into the files that already exist rather than pasting snippets into a chat window.&lt;/p&gt;

&lt;p&gt;For Scrapy projects specifically, this matters. A spider is not a standalone script. It depends on items, settings, middlewares, pipelines, and page objects. An agent that can see all of those files at once produces far better output than one operating on a blank context.&lt;/p&gt;

&lt;p&gt;opencode also supports custom commands stored as Markdown files. That means you can encode your own Scrapy conventions as reusable prompts and call them every time you start a new spider, without retyping the same context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting set up
&lt;/h2&gt;

&lt;p&gt;Install opencode with the one-liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://opencode.ai/install | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On macOS and Linux, the Homebrew tap gives you the fastest updates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;anomalyco/tap/opencode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows, use WSL for the best experience. The &lt;code&gt;choco install opencode&lt;/code&gt; path works but the terminal experience is noticeably smoother inside a Linux environment.&lt;/p&gt;

&lt;p&gt;Once installed, connect your model provider. The &lt;code&gt;/connect&lt;/code&gt; command in the terminal user interface walks you through it. If you want to avoid managing API keys from multiple providers, opencode Zen gives you a curated set of pre-tested models through a single subscription at &lt;a href="https://opencode.ai/auth" rel="noopener noreferrer"&gt;opencode.ai/auth&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For scraping work, choose a model with a large context window. Spider files, page objects, items, and a sample HTML fixture can easily fill 20,000 tokens before you have written a single prompt. Models with at least 64k context are the practical minimum.&lt;/p&gt;

&lt;h3&gt;
  
  
  Initialize your Scrapy project first
&lt;/h3&gt;

&lt;p&gt;Before you open opencode, scaffold your Scrapy project as you normally would:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy startproject myproject
&lt;span class="nb"&gt;cd &lt;/span&gt;myproject
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then initialize opencode inside the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;opencode init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates an &lt;code&gt;AGENTS.md&lt;/code&gt; file. Commit it. opencode reads this file on every session to understand how your project is structured. Fill it with the conventions your project follows: which item classes exist, which middlewares are active, whether you are using &lt;a href="https://scrapy-poet.readthedocs.io/" rel="noopener noreferrer"&gt;scrapy-poet&lt;/a&gt; page objects, and which version of &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; or other HTTP backends you are using. The more context &lt;code&gt;AGENTS.md&lt;/code&gt; carries, the less you repeat yourself in prompts.&lt;/p&gt;

&lt;p&gt;A minimal &lt;code&gt;AGENTS.md&lt;/code&gt; for a Scrapy project looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Project conventions&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Python 3.12, Scrapy 2.12
&lt;span class="p"&gt;-&lt;/span&gt; All spiders use scrapy-poet page objects (never parse in the spider class itself)
&lt;span class="p"&gt;-&lt;/span&gt; Item classes are defined in items.py using dataclasses
&lt;span class="p"&gt;-&lt;/span&gt; Zyte API is configured via scrapy-zyte-api; ZYTE_API_KEY is in .env
&lt;span class="p"&gt;-&lt;/span&gt; Settings live in settings.py; never hardcode values in spider files
&lt;span class="p"&gt;-&lt;/span&gt; All spiders output to JSON Lines via FEEDS setting
&lt;span class="p"&gt;-&lt;/span&gt; Test fixtures live in tests/fixtures/ as .html files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The prompts that actually work
&lt;/h2&gt;

&lt;p&gt;Generic prompts produce generic code. The prompts below are tested patterns that produce Scrapy-idiomatic output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Starting a new spider
&lt;/h3&gt;

&lt;p&gt;The most common mistake is asking opencode to "write a spider for X." That produces a working script, not a Scrapy spider. Be specific about structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Create&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Scrapy&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;books&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;toscrape&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Uses&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;poet&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt; &lt;span class="n"&gt;called&lt;/span&gt; &lt;span class="n"&gt;BookListPage&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;BookDetailPage&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Extracts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;star&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="n"&gt;URL&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Handles&lt;/span&gt; &lt;span class="n"&gt;pagination&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt; &lt;span class="n"&gt;following&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;next&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Stores&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Does&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;put&lt;/span&gt; &lt;span class="nb"&gt;any&lt;/span&gt; &lt;span class="n"&gt;CSS&lt;/span&gt; &lt;span class="n"&gt;selector&lt;/span&gt; &lt;span class="n"&gt;logic&lt;/span&gt; &lt;span class="n"&gt;inside&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;itself&lt;/span&gt;

&lt;span class="n"&gt;Start&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spiders&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;books&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The explicit constraint against putting selectors in the spider class is important. Without it, the agent will inline everything, which defeats scrapy-poet's purpose and makes the code harder to test.&lt;/p&gt;

&lt;h3&gt;
  
  
  Asking for resilient selectors
&lt;/h3&gt;

&lt;p&gt;Generated selectors are often too specific. They target a class that is only present on one layout variant, or chain through five levels of nesting that will break on the next site deploy.&lt;/p&gt;

&lt;p&gt;Prompt the agent to justify its selector choices:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight css"&gt;&lt;code&gt;&lt;span class="nt"&gt;Write&lt;/span&gt; &lt;span class="nt"&gt;the&lt;/span&gt; &lt;span class="nt"&gt;CSS&lt;/span&gt; &lt;span class="nt"&gt;selectors&lt;/span&gt; &lt;span class="nt"&gt;for&lt;/span&gt; &lt;span class="nt"&gt;BookDetailPage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;For&lt;/span&gt; &lt;span class="nt"&gt;each&lt;/span&gt; &lt;span class="nt"&gt;field&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nt"&gt;explain&lt;/span&gt; &lt;span class="nt"&gt;why&lt;/span&gt; &lt;span class="nt"&gt;you&lt;/span&gt; &lt;span class="nt"&gt;chose&lt;/span&gt;
&lt;span class="nt"&gt;that&lt;/span&gt; &lt;span class="nt"&gt;selector&lt;/span&gt; &lt;span class="nt"&gt;over&lt;/span&gt; &lt;span class="nt"&gt;alternatives&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;Prefer&lt;/span&gt; &lt;span class="nt"&gt;attribute-based&lt;/span&gt; &lt;span class="nt"&gt;selectors&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;like&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;itemprop&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="nt"&gt;or&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;data-&lt;/span&gt;&lt;span class="o"&gt;*])&lt;/span&gt; &lt;span class="nt"&gt;over&lt;/span&gt; &lt;span class="nt"&gt;class&lt;/span&gt; &lt;span class="nt"&gt;names&lt;/span&gt; &lt;span class="nt"&gt;where&lt;/span&gt; &lt;span class="nt"&gt;both&lt;/span&gt; &lt;span class="nt"&gt;options&lt;/span&gt; &lt;span class="nt"&gt;exist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This produces more defensive selectors and, more importantly, gives you enough reasoning to judge whether to accept them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding error handling
&lt;/h3&gt;

&lt;p&gt;The agent will skip error handling unless you ask for it explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Add&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="n"&gt;handling&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;BookDetailPage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;If&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;warning&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;None &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;do&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;an&lt;/span&gt; &lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;If&lt;/span&gt; &lt;span class="n"&gt;star&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt; &lt;span class="n"&gt;cannot&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Add&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;around&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsing&lt;/span&gt; &lt;span class="n"&gt;fails&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Never assume the agent will add graceful degradation on its own. It optimizes for the happy path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing tests
&lt;/h3&gt;

&lt;p&gt;opencode is genuinely useful for generating pytest fixtures and test scaffolding. Give it a concrete fixture to work from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Write&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;BookDetailPage&lt;/span&gt; &lt;span class="n"&gt;using&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;HTML&lt;/span&gt; &lt;span class="n"&gt;fixture&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt;
&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;fixtures&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;book_detail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;extracted&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;non&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="n"&gt;greater&lt;/span&gt; &lt;span class="n"&gt;than&lt;/span&gt; &lt;span class="n"&gt;zero&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;In stock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Out of stock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;star_rating&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;an&lt;/span&gt; &lt;span class="n"&gt;integer&lt;/span&gt; &lt;span class="n"&gt;between&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;

&lt;span class="n"&gt;Use&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt; &lt;span class="n"&gt;parametrize&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;testing&lt;/span&gt; &lt;span class="n"&gt;multiple&lt;/span&gt; &lt;span class="n"&gt;fixture&lt;/span&gt; &lt;span class="n"&gt;variants&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pitfalls to watch for
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The agent assumes the HTML is static
&lt;/h3&gt;

&lt;p&gt;By default, any spider the agent generates will use &lt;code&gt;response.css()&lt;/code&gt; or &lt;code&gt;response.xpath()&lt;/code&gt; on raw HTML. If your target site renders content with JavaScript, those selectors return nothing. Before you run any generated spider, check whether the target page is JavaScript-rendered by viewing source in your browser. If the data you need is absent from the raw HTML, prompt the agent to use &lt;a href="https://www.zyte.com/zyte-api/headless-browser/" rel="noopener noreferrer"&gt;Zyte API's headless browser&lt;/a&gt; or a Playwright download handler instead of a plain HTTP request.&lt;/p&gt;

&lt;h3&gt;
  
  
  Selectors written against one page break on others
&lt;/h3&gt;

&lt;p&gt;The agent writes selectors against whatever HTML you give it. If you paste a single product page, it will produce selectors that work on that product page. Run the spider against 10 or 20 URLs from the same site before treating the selectors as reliable.&lt;/p&gt;

&lt;p&gt;Ask the agent to help you validate coverage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;Here are three different product page HTML snippets from the same site (pasted below).
Identify any selectors in BookDetailPage that would fail on snippet 2 or snippet 3,
and suggest more robust alternatives.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Context window exhaustion mid-session
&lt;/h3&gt;

&lt;p&gt;Long sessions that involve large HTML files, multiple spider files, and back-and-forth debugging will eventually exhaust the model's context. When this happens, the agent starts contradicting earlier decisions or forgetting your project conventions.&lt;/p&gt;

&lt;p&gt;The fix is to keep sessions short and focused. One session per spider, or one session per refactor task. Use your &lt;code&gt;AGENTS.md&lt;/code&gt; to carry conventions across sessions rather than re-explaining them in chat.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generated settings override your existing configuration
&lt;/h3&gt;

&lt;p&gt;When the agent writes setup instructions, it often suggests adding settings directly to &lt;code&gt;settings.py&lt;/code&gt;. If you already have a settings file, this can clobber existing values or introduce conflicts. Review every settings change the agent proposes before accepting it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The agent does not know about anti-bot measures
&lt;/h3&gt;

&lt;p&gt;opencode has no knowledge of whether a site actively blocks scrapers. It will happily generate a spider that will be blocked immediately in production. Anti-bot handling, rate limiting, and request fingerprinting are your responsibility to layer in. &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; handles the blocking and fingerprinting side; you still need to configure the integration yourself rather than expecting the agent to know it is necessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful custom commands for scraping
&lt;/h2&gt;

&lt;p&gt;opencode custom commands let you encode reusable prompts as Markdown files in &lt;code&gt;~/.config/opencode/commands/&lt;/code&gt;. Here are three worth setting up for any Scrapy workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;user:new-spider&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# New scrapy-poet spider&lt;/span&gt;

Create a new Scrapy spider for the URL provided by the user.
&lt;span class="p"&gt;-&lt;/span&gt; Use scrapy-poet page objects (list page + detail page if applicable)
&lt;span class="p"&gt;-&lt;/span&gt; Put all selector logic in page objects, nothing in the spider class
&lt;span class="p"&gt;-&lt;/span&gt; Use item dataclasses from items.py (create new ones if needed)
&lt;span class="p"&gt;-&lt;/span&gt; Include pagination handling
&lt;span class="p"&gt;-&lt;/span&gt; Add logging for missing fields (warning level)
&lt;span class="p"&gt;-&lt;/span&gt; Write page objects first, then the spider

Ask the user for the target URL before starting.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;user:harden-selectors&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Harden selectors&lt;/span&gt;

Review the page objects in the current file. For each CSS or XPath selector:
&lt;span class="p"&gt;1.&lt;/span&gt; Identify whether it targets a class, ID, tag, or attribute
&lt;span class="p"&gt;2.&lt;/span&gt; If it targets a class name, suggest an attribute-based or structural alternative
&lt;span class="p"&gt;3.&lt;/span&gt; Flag any selectors that chain more than three levels deep as fragile

Output a revised version of the file with improved selectors and inline comments
explaining each change.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;user:gen-tests&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Generate pytest tests&lt;/span&gt;

Given a page object file and an HTML fixture provided by the user:
&lt;span class="p"&gt;1.&lt;/span&gt; Write a pytest test file that covers all extracted fields
&lt;span class="p"&gt;2.&lt;/span&gt; Test that required fields are non-null and the correct type
&lt;span class="p"&gt;3.&lt;/span&gt; Test that optional fields handle absence gracefully (None, not exception)
&lt;span class="p"&gt;4.&lt;/span&gt; Use parametrize if multiple fixture variants are present

Ask for the fixture file path before starting.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where opencode fits in the workflow
&lt;/h2&gt;

&lt;p&gt;Think of opencode as a fast first-draft tool, not an autonomous spider factory. The right workflow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scaffold the project and write &lt;code&gt;AGENTS.md&lt;/code&gt; manually&lt;/li&gt;
&lt;li&gt;Use opencode to generate page objects and the spider skeleton&lt;/li&gt;
&lt;li&gt;Review every selector by hand before trusting it&lt;/li&gt;
&lt;li&gt;Run the spider against a sample of real URLs and inspect the output&lt;/li&gt;
&lt;li&gt;Use opencode to patch failures and write tests&lt;/li&gt;
&lt;li&gt;Handle anti-bot, rate limiting, and deployment yourself&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agent saves the most time on the repetitive structural work: boilerplate item classes, pagination logic, field extraction scaffolding, and test stubs. The judgment calls around which selectors are robust, whether a site is JavaScript-rendered, and how to handle blocking remain entirely in your hands.&lt;/p&gt;

&lt;p&gt;That division of labor is what makes this approach work at production scale rather than just for prototypes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;Install opencode, initialize it in an existing Scrapy project, and start with the &lt;code&gt;user:new-spider&lt;/code&gt; custom command above. Pick a publicly accessible, static site like &lt;a href="https://books.toscrape.com" rel="noopener noreferrer"&gt;books.toscrape.com&lt;/a&gt; to test the workflow before applying it to a site with more complexity.&lt;/p&gt;

&lt;p&gt;For the JavaScript-rendered sites and anything with active anti-bot measures, pair opencode's code generation with &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; to handle the access layer. You can &lt;a href="https://app.zyte.com/account/signup/zyteapi" rel="noopener noreferrer"&gt;sign up for a free trial&lt;/a&gt; and have a working integration running in minutes. The &lt;a href="https://docs.zyte.com/" rel="noopener noreferrer"&gt;Zyte documentation&lt;/a&gt; covers the scrapy-zyte-api configuration in detail.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>ai</category>
      <category>opencode</category>
      <category>scrapy</category>
    </item>
    <item>
      <title>Small models, big ideas: what Google Gemma and MoE mean for developers</title>
      <dc:creator>Ayan Pahwa</dc:creator>
      <pubDate>Tue, 07 Apr 2026 12:50:03 +0000</pubDate>
      <link>https://dev.to/extractdata/small-models-big-ideas-what-google-gemma-and-moe-mean-for-developers-3038</link>
      <guid>https://dev.to/extractdata/small-models-big-ideas-what-google-gemma-and-moe-mean-for-developers-3038</guid>
      <description>&lt;p&gt;We at zyte-devrel try to stay plugged into what is happening in the AI and developer tooling space, not just because it is interesting, but because a lot of it starts having real implications for how we build and think about web data pipelines. Lately, one development that has had us genuinely curious is Google's new &lt;a href="https://deepmind.google/models/gemma/" rel="noopener noreferrer"&gt;Gemma 4&lt;/a&gt; model family, and specifically the direction it points toward with Mixture of Experts (MoE) architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3g8mot3kvpvdj01sf8s.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3g8mot3kvpvdj01sf8s.jpg" alt="IGemma 4 on iPhone" width="800" height="1517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is not a deep tutorial. It is more of a "hey, here is what we have been poking at" - the kind of update we would share in a Slack channel or over coffee. If you wanna participate in such discussions, our &lt;a href="https://discord.com/invite/DwTnbrm83s" rel="noopener noreferrer"&gt;discord is always a welcoming platform&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  What is Gemma 4?
&lt;/h2&gt;

&lt;p&gt;Gemma has been dubbed as stripped down versions of Google Gemini. The new Gemma 4 is Google's latest family of open-weight language models, released last week. The lineup covers four sizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2B&lt;/strong&gt;: ultra-efficient, built for mobile and edge devices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4B&lt;/strong&gt;: enhanced multimodal capabilities, still edge-deployable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;26B&lt;/strong&gt;: sparse model using Mixture of Experts architecture (more on this below)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;31B&lt;/strong&gt;: dense model for more demanding tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All four variants support multimodal input (text and images), over 140 languages, a 128K-256K token context window, and agentic workflows with tool use and JSON output. The 2B and 4B models are specifically designed to run fully offline on modern edge devices like smartphones, with no internet dependency at all.&lt;/p&gt;

&lt;p&gt;According to Google's &lt;a href="https://deepmind.google/models/gemma/" rel="noopener noreferrer"&gt;Gemma 4 model page&lt;/a&gt;, the family ranks third among open-weighted models on the LM Arena leaderboard and uses 2.5 times fewer tokens than comparable models for equivalent tasks. &lt;/p&gt;

&lt;p&gt;The Gemma 4 26B MoE, specially caught my attention because unlike other variants it's based on MoE architecture and it does make a difference :&lt;/p&gt;

&lt;h2&gt;
  
  
  What is MoE, and why does it matter?
&lt;/h2&gt;

&lt;p&gt;Mixture of Experts (MoE) is one of those ideas that sounds complex but is actually pretty intuitive once you hear the analogy.&lt;/p&gt;

&lt;p&gt;In a traditional dense neural network, every parameter in the model activates for every input. It is like calling your entire company into a meeting every time someone has a question. It works, but it is expensive.&lt;/p&gt;

&lt;p&gt;MoE works differently. Instead of one large model doing everything, you have a set of smaller "expert" sub-networks, each specialized in different patterns, plus a router that looks at each incoming token and decides which one or two experts to activate. Most of the model sits idle at any given moment.&lt;/p&gt;

&lt;p&gt;The result: you get the quality of a much larger model at a fraction of the inference cost.&lt;/p&gt;

&lt;p&gt;The Gemma 4 26B model is a great illustration of this. It has 26 billion total parameters, but during inference it only activates around 3.8 billion of them. You get near-26B quality at roughly 3.8B compute cost. That is the MoE advantage in one number.&lt;/p&gt;

&lt;p&gt;Other models that take the same approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mixtral 8x7B&lt;/strong&gt;: eight experts, two active per token; it &lt;a href="https://huggingface.co/blog/moe" rel="noopener noreferrer"&gt;outperforms Llama 2 70B&lt;/a&gt; on most benchmarks at far lower inference cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kimi&lt;/strong&gt;: Moonshot AI's model, also MoE-based, has been making similar waves in the open-model space&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a deep dive on how MoE works under the hood, the &lt;a href="https://huggingface.co/blog/moe" rel="noopener noreferrer"&gt;Hugging Face guide to mixture of experts&lt;/a&gt; is well worth the read.&lt;/p&gt;

&lt;p&gt;Since the models are free, if you have the right machine you can host them lcoally using &lt;a href="https://ollama.com/library/gemma4" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; or call them using API services like &lt;a href="https://openrouter.ai/google/gemma-4-26b-a4b-it" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;My prefered way of using a new mode is through &lt;a href="https://claude.ai/login" rel="noopener noreferrer"&gt;Claude&lt;/a&gt;, but I believe Gemma4 has a different tool calling structure so it is not compatible yet, but you can use it with &lt;a href="https://lmstudio.ai/models/gemma-4" rel="noopener noreferrer"&gt;LMStudio&lt;/a&gt;, or skil all that because you can now&lt;/p&gt;

&lt;h2&gt;
  
  
  Run Gemma 4 offline on an iPhone
&lt;/h2&gt;

&lt;p&gt;Here is the part worth sharing, because it genuinely surprised us.&lt;/p&gt;

&lt;p&gt;Using the &lt;a href="https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337" rel="noopener noreferrer"&gt;Google Edge AI Gallery app&lt;/a&gt; from the App Store, you can load a Gemma 4 model and run it with airplane mode on. No API calls, no cloud round-trips, no data leaving the device. Just the model running locally on your phone.&lt;/p&gt;

&lt;p&gt;The experience is not going to replace a foundational frontier model for complex reasoning. But that is not the point. For quick classification, summarization, or just experimenting with local inference, the 2B and 4B variants are remarkably capable, and there are zero API costs with no data leaving your device. And since it is multi-modal you can practically point your phone camera to a paper recipt and ask it to save the details in a spreadhseet. &lt;/p&gt;

&lt;p&gt;If you have not tried running a local large language model (LLM) yet, this is probably the lowest-friction entry point on hardware you already own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why should developers building data pipelines care?
&lt;/h2&gt;

&lt;p&gt;Here is where it connects back to what a lot of us are building.&lt;/p&gt;

&lt;p&gt;When LLMs run on-device or at the edge, the calculus around data pipelines shifts in a few useful directions:&lt;/p&gt;

&lt;p&gt;Tokens are getting expensive and when a model as good as Gemma 4 or Qwen-3.5 is free and open-weighted it's a welcome development. Everyone's complaining about running out of their claude usage quota last couple of weeks or getting huge bills, thanks to giving Opus API Keys to OpenClaw. These things can be significantly addressed using Open Models.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No API round-trips&lt;/strong&gt;: on-device inference eliminates latency from cloud API calls. For classification tasks running inside a scraping pipeline, this is a meaningful difference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data privacy&lt;/strong&gt;: running extraction locally means scraped content never leaves your infrastructure. For regulated industries or sensitive datasets, that is a significant advantage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost at scale&lt;/strong&gt;: if you are doing high-volume classification — is this a product page? is this content in the target language? — running a small local model beats paying per-token at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge preprocessing&lt;/strong&gt;: a small LLM can filter and classify pages before they ever reach a more expensive cloud model for deeper analysis, and I am personally looking forward to run them on SBCs like a Raspberry Pi.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open Weights&lt;/strong&gt;: people often confuse open-weights models with open-source models, while the lines may be blurry and even I don't fully understand the difference, one thing I know for sure is that Gemma 4 is available under the Apache 2.0 license, which allows building and selling products on top of it and open-weights allows you to fine-tune it for your use-case or application. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's me playing it with it on my iPhone 16, completely offline:&lt;br&gt;
  &lt;/p&gt;
&lt;div&gt;
    &lt;iframe src="https://www.youtube.com/embed/V88iUYQd4BU"&gt;
    &lt;/iframe&gt;
  &lt;/div&gt;


&lt;h2&gt;
  
  
  Just checking in
&lt;/h2&gt;

&lt;p&gt;We do not have grand proclamations here. This is a space that is moving fast, and we are learning alongside everyone else.&lt;/p&gt;

&lt;p&gt;If you have been experimenting with local LLMs in your scraping or data extraction workflows, we would genuinely love to hear about it. Drop a comment below, or find us on the &lt;a href="https://discord.com/invite/DwTnbrm83s" rel="noopener noreferrer"&gt;Zyte discord&lt;/a&gt; and read more interesting blogs on &lt;a href="https://www.zyte.com/blog/" rel="noopener noreferrer"&gt;Zyte Blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you want to try this yourself, here are three good starting points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337" rel="noopener noreferrer"&gt;Google Edge Gallery&lt;/a&gt;: available on the App Store and Playstore, runs Gemma 4 locally on iOS&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/google/gemma" rel="noopener noreferrer"&gt;Gemma models on Hugging Face&lt;/a&gt;: for running on desktop or server&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://deepmind.google/models/gemma/" rel="noopener noreferrer"&gt;Google's Gemma 4 model page&lt;/a&gt;: full family overview, benchmarks, and architecture details&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>google</category>
      <category>llm</category>
      <category>news</category>
    </item>
    <item>
      <title>Build Scrapy spiders in 23.54 seconds with this free Claude skill</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Mon, 30 Mar 2026 17:50:39 +0000</pubDate>
      <link>https://dev.to/extractdata/build-scrapy-spiders-in-2354-seconds-with-this-free-claude-skill-50em</link>
      <guid>https://dev.to/extractdata/build-scrapy-spiders-in-2354-seconds-with-this-free-claude-skill-50em</guid>
      <description>&lt;p&gt;I built a Claude skill that generates &lt;a href="https://docs.scrapy.org/en/latest" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt; spiders in under 30 seconds — ready to run, ready to extract good data. In this post I'll walk through what I built, the design decisions behind it, and where I think it can go next.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;The skill takes a single input: a category or product listing URL. From there, Claude generates a complete, runnable Scrapy spider as a single Python script. No project setup, no configuration files, no boilerplate to write. Just a script you can run immediately.&lt;/p&gt;

&lt;p&gt;Here's what that looks like in practice. I opened Claude Code in an empty folder with dependencies installed, activated the skill, and said: "Create a spider for this site" — and pasted a URL.&lt;/p&gt;

&lt;p&gt;Within seconds, the script was generated. I ran it, watched the products roll in, piped the output through &lt;code&gt;jq&lt;/code&gt;, and had clean structured product data. Start to finish: under a minute.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/2pQD412kJIw"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a single-file script, not a full Scrapy project?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.scrapy.org/en/latest/topics/practices.html" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt; is usually a full project — multiple files, lots of moving parts, a proper setup process. Running it from a script instead is generally discouraged for production work, but for this use case it's actually the right call.&lt;/p&gt;

&lt;p&gt;The goal here is what I'd call pump-and-dump scraping: give Claude a URL, get a spider, run it for a couple of days, move on. It's not designed to scrape millions of products every day for years. For that kind of scale you need proper infrastructure, robust monitoring, and serious logging. This isn't that — and that's intentional.&lt;/p&gt;

&lt;p&gt;What you do get, even in the single-file approach, is almost everything Scrapy offers: middleware, automatic retries, and concurrency handling. You'd have to build all of that yourself with a plain &lt;code&gt;requests&lt;/code&gt; script. Scrapy gives it to you for free, even when running from a script.&lt;/p&gt;

&lt;h2&gt;
  
  
  The key design decision: AI extraction
&lt;/h2&gt;

&lt;p&gt;The other major call I made was to lean entirely on &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;'s AI extraction rather than generating CSS or XPath selectors.&lt;/p&gt;

&lt;p&gt;Specifically, the skill uses two extraction types chained together: &lt;code&gt;productNavigation&lt;/code&gt; on the category or listing page, which returns product URLs and the next page link, and &lt;code&gt;product&lt;/code&gt; on each product URL, which returns structured product data including name, price, availability, brand, SKU (stock keeping unit), description, images, and more.&lt;/p&gt;

&lt;p&gt;This means the spider doesn't need to know anything about the structure of the site it's crawling. There are no selectors to generate, no schema to define, no user confirmation step. The AI on &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte's&lt;/a&gt; end handles all of that. It does cost slightly more than a raw HTTP request, but given how little time it takes to go from URL to working spider, the trade-off makes sense.&lt;/p&gt;

&lt;p&gt;I've hardcoded &lt;code&gt;httpResponseBody&lt;/code&gt; as the extraction source — it's faster and more cost-efficient than browser rendering. If a site is JavaScript-heavy and you're not getting the data you need, you can switch to &lt;code&gt;browserHtml&lt;/code&gt; with a one-line change. The spider logs a warning to remind you of this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The use case is deliberately narrow
&lt;/h2&gt;

&lt;p&gt;This skill is designed for e-commerce sites, and only e-commerce sites. That's not a limitation I stumbled into — it's a feature.&lt;/p&gt;

&lt;p&gt;Because the scope is narrow, the spider structure is simple and predictable: category pages with pagination, product links, and detail pages. &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;'s &lt;code&gt;productNavigation&lt;/code&gt; and &lt;code&gt;product&lt;/code&gt; extraction types handle this reliably. Widening the scope to arbitrary crawling would require a lot more of Scrapy's machinery and would quickly exceed what makes sense for a lightweight script like this.&lt;/p&gt;

&lt;p&gt;What it doesn't do: deep subcategory crawling, link discovery, or full-site crawls. If a page renders all its products without pagination, that works fine — the next page just returns nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logging and output
&lt;/h2&gt;

&lt;p&gt;I replaced &lt;a href="https://docs.scrapy.org/en/latest/topics/practices.html" rel="noopener noreferrer"&gt;Scrapy's&lt;/a&gt; default logging with Rich logging, which gives cleaner terminal output. Scrapy's logs are verbose in ways that aren't useful when you're running a short-lived script — I wanted something concise enough that if something went wrong, it would be obvious at a glance.&lt;/p&gt;

&lt;p&gt;Output goes to a &lt;code&gt;.jsonl&lt;/code&gt; file named after the spider, alongside a plain &lt;code&gt;.log&lt;/code&gt; file. Both are derived from the spider name, which is itself derived from the domain. Run &lt;code&gt;example_com.py&lt;/code&gt;, get &lt;code&gt;example_com.jsonl&lt;/code&gt; and &lt;code&gt;example_com.log&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this goes next
&lt;/h2&gt;

&lt;p&gt;The immediate next step I have in mind is selector-based extraction as an alternative path — useful for sites where the AI extraction isn't quite right, or where you want more control over exactly what gets pulled.&lt;/p&gt;

&lt;p&gt;The longer-term vision is running this fully agentically. URLs get submitted somewhere — a queue, a database table, a form — an agent picks them up, builds the spider, and maybe runs a quick validation. The spider then goes into a pool to be run on a schedule, and data lands in a database rather than a flat file. Give Claude access to a virtual private server (VPS) via terminal and most of this is achievable without much extra infrastructure. The skill is already the hard part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Download the skill
&lt;/h2&gt;

&lt;p&gt;The skill is free to download and use. It's a single &lt;code&gt;.skill&lt;/code&gt; file you can install directly into Claude Code. You'll need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy scrapy-zyte-api rich
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.zyte.com/blog/scrapy-in-2026-modern-async-crawling/" rel="noopener noreferrer"&gt;Scrapy 2.13 &lt;/a&gt;or above is required for &lt;code&gt;AsyncCrawlerProcess&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The link to the repo and the skill download are in the video description, and &lt;a href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;here&lt;/a&gt;. If you've built something similar, or have thoughts on the design decisions — especially around the extraction approach or the logging setup — I'd love to hear it in the comments. GitHub links to your own scrapers are very welcome too.&lt;/p&gt;

&lt;p&gt;If you're interested in more agentic scraping patterns, I also built a Claude skill that helps spiders recover from excessive bans — you can watch that video &lt;a href="https://youtu.be/bF24BLZWlOk" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>scrapy</category>
      <category>claudeskills</category>
      <category>webscraping</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Built a Self-Healing Web Scraper to Auto-Solve 403s</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Mon, 16 Mar 2026 11:09:31 +0000</pubDate>
      <link>https://dev.to/extractdata/i-built-a-self-healing-web-scraper-to-auto-solve-403s-1jg1</link>
      <guid>https://dev.to/extractdata/i-built-a-self-healing-web-scraper-to-auto-solve-403s-1jg1</guid>
      <description>&lt;p&gt;Web scraping has a recurring enemy: the 403. Sites add bot detection, anti-scraping tools update their challenges, and scrapers that worked fine last week start silently failing. The usual fix is manual — check the logs, diagnose the cause, update the config, redeploy. I wanted to see if an agent could handle that loop instead.&lt;/p&gt;

&lt;p&gt;So I built a self-healing scraper. After each crawl, a Claude-powered agent reads the failure logs, probes the broken domains with escalating fetch strategies, and rewrites the config automatically. By the next run, it's already fixed itself.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/bF24BLZWlOk"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;The project has two parts: a scraper and a self-healing agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  The scraper
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;main.py&lt;/code&gt; is a straightforward Python scraper driven entirely by a &lt;code&gt;config.json&lt;/code&gt; file. Each domain entry tells the scraper which URLs to fetch and how to fetch them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"books"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"zyte"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"browser_html"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"https://www.bookstocsrape.co.uk/products/..."&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are three fetch modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct&lt;/strong&gt; — a plain &lt;code&gt;requests.get()&lt;/code&gt;. Fast, free, works for sites that don't block bots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; (&lt;code&gt;httpResponseBody&lt;/code&gt;)&lt;/strong&gt; — routes the request through Zyte's residential proxy network. Good for sites that block datacenter IPs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; (&lt;code&gt;browserHtml&lt;/code&gt;)&lt;/strong&gt; — spins up a real browser via Zyte, executes JavaScript, and returns the fully-rendered DOM. Required for sites using JS-based bot challenges.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every request is logged to &lt;code&gt;scraper.log&lt;/code&gt; in the same format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-03-14 09:12:01 url=https://... domain_id=scan status=200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a request throws any exception, it's recorded as a 403. That keeps the log clean and gives the agent a consistent signal to act on.&lt;/p&gt;

&lt;h3&gt;
  
  
  The self-healing agent
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;agent.py&lt;/code&gt; is a Claude-powered agent that runs after each crawl. It uses the &lt;a href="https://github.com/anthropics/claude-agent-sdk-python" rel="noopener noreferrer"&gt;Claude Agent SDK&lt;/a&gt; and has access to three tools: &lt;code&gt;Read&lt;/code&gt;, &lt;code&gt;Bash&lt;/code&gt;, and &lt;code&gt;Edit&lt;/code&gt; — enough to operate completely autonomously.&lt;/p&gt;

&lt;p&gt;The agent works through a staged process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Read the log&lt;/strong&gt; — finds every domain that returned a 403&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-reference the config&lt;/strong&gt; — skips domains already configured to use Zyte&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 1 probe&lt;/strong&gt; — uses the &lt;code&gt;zyte-api&lt;/code&gt; CLI to fetch one URL per failing domain with &lt;code&gt;httpResponseBody&lt;/code&gt;, then inspects the page &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Challenge detection&lt;/strong&gt; — if the title contains phrases like &lt;em&gt;"Just a moment"&lt;/em&gt;, &lt;em&gt;"Checking your browser"&lt;/em&gt;, or &lt;em&gt;"Verifying you are human"&lt;/em&gt;, the page is flagged as a bot challenge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 2 probe&lt;/strong&gt; — challenge pages are re-probed using &lt;code&gt;browserHtml&lt;/code&gt;, which runs a real browser to bypass JS-based detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config update&lt;/strong&gt; — the agent edits &lt;code&gt;config.json&lt;/code&gt; directly, setting &lt;code&gt;zyte: true&lt;/code&gt; and/or &lt;code&gt;browserHtml: true&lt;/code&gt; for domains that now work&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The next crawl automatically uses the right fetch strategy. No manual intervention needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Config-driven, not code-driven
&lt;/h3&gt;

&lt;p&gt;Everything lives in &lt;code&gt;config.json&lt;/code&gt;. Adding a new domain is a one-liner, and the scraper doesn't need to know anything about individual sites — it just reads the config and follows instructions. The agent writes to the same file, so the loop closes itself naturally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graduated fetch strategy
&lt;/h3&gt;

&lt;p&gt;Not every site needs an expensive browser render. By escalating from direct to &lt;code&gt;httpResponseBody&lt;/code&gt; to &lt;code&gt;[browserHtml](https://www.zyte.com/zyte-api/headless-browser/)&lt;/code&gt; only when necessary, I keep costs manageable. Browser renders are slower and consume more API credits — reserving them for sites that actually need them makes a meaningful difference at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Letting the agent handle the heuristics
&lt;/h3&gt;

&lt;p&gt;The challenge detection logic — matching titles against known bot-detection phrases — is exactly the kind of fuzzy heuristic that's tedious to maintain as code but natural for a language model to reason about. Claude also handles edge cases gracefully: if the &lt;code&gt;zyte-api&lt;/code&gt; CLI isn't installed, if the log is empty, if a domain is already correctly configured. A rule-based script would need explicit handling for every one of those scenarios.&lt;/p&gt;

&lt;h2&gt;
  
  
  The limitations
&lt;/h2&gt;

&lt;p&gt;It's worth being honest about where this approach falls short.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's reactive, not proactive.&lt;/strong&gt; The agent only runs after a failed crawl. If a site starts blocking mid-run, those URLs fail silently until the next cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Title-based detection is fragile.&lt;/strong&gt; Most bot-challenge pages say &lt;em&gt;"Just a moment…"&lt;/em&gt; — but a legitimate site could theoretically use that phrase. A false positive would cause the scraper to wastefully use browser rendering where it isn't needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One URL per domain.&lt;/strong&gt; The agent probes only the first failing URL for each domain. Different URL patterns on the same domain can have different bot-detection behaviour, which this doesn't account for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No rollback.&lt;/strong&gt; Once the config is updated, there's no way to detect if a Zyte setting later stops working and revert it automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost opacity.&lt;/strong&gt; The scraper logs HTTP status codes, not &lt;a href="https://www.zyte.com/pricing/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; credit consumption. There's no visibility into what each domain actually costs to fetch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I'd take it next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Smarter challenge detection.&lt;/strong&gt; Rather than keyword-matching on the title, the agent could read the full page HTML and make a more nuanced call — is this a product page, a login wall, or a soft block with a CAPTCHA? Each requires a different response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proactive monitoring.&lt;/strong&gt; A lightweight probe running daily against each configured domain, independent of the main crawl, would let the agent update the config &lt;em&gt;before&lt;/em&gt; a full scrape run hits a known-bad configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-URL config.&lt;/strong&gt; Right now &lt;code&gt;zyte&lt;/code&gt; and &lt;code&gt;browser_html&lt;/code&gt; are set at the domain level. Some sites serve static product pages on one path and JS-rendered category pages on another — granular per-URL settings would handle that cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured data extraction.&lt;/strong&gt; Right now &lt;code&gt;parse_page&lt;/code&gt; only pulls the page title. The natural next step is structured product extraction — price, availability, name, images — either via CSS selectors in the config or Zyte's &lt;code&gt;product&lt;/code&gt; extraction type, which uses ML models to parse product data from any page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent parallelism.&lt;/strong&gt; The self-healing loop is currently a single agent. As the config grows, a coordinator could spawn one subagent per failing domain, each running its own probe pipeline concurrently. The Claude Agent SDK supports subagents natively, so this would be a relatively small change.&lt;/p&gt;

&lt;p&gt;The core idea is simple: a scraper that observes its own failures and reconfigures itself. What I found interesting about building it wasn't the scraping itself — it was seeing how little scaffolding the agent actually needed. Three tools, a clear task, and it handles the diagnostic work that would otherwise fall to me.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>agents</category>
      <category>ai</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>How I get Claude to build HTML parsing code the way I want it</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 11 Mar 2026 07:13:06 +0000</pubDate>
      <link>https://dev.to/extractdata/html-parsing-with-claude-skills-extracting-structured-data-from-raw-html-1ai0</link>
      <guid>https://dev.to/extractdata/html-parsing-with-claude-skills-extracting-structured-data-from-raw-html-1ai0</guid>
      <description>&lt;p&gt;Getting HTML off a page is only the first step. Once you have it, the real work begins: pulling out the data that actually matters — product names, prices, ratings, specifications — in a clean, structured format you can actually do something with.&lt;/p&gt;

&lt;p&gt;That's what the parser skill is for. If you haven't read the introduction to skills in our fetcher post, it's worth a quick look first. But the short version is this: a skill is a &lt;code&gt;SKILL.md&lt;/code&gt; file that gives Claude precise, reusable instructions for using a specific tool. The parser skill is one of three that together form a complete web scraping pipeline.&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/zytelabs" rel="noopener noreferrer"&gt;
        zytelabs
      &lt;/a&gt; / &lt;a href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;
        claude-webscraping-skills
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;claude-webscraping-skills&lt;/h1&gt;

&lt;/div&gt;
&lt;p&gt;A collection of claude skills and other tools to assist your web-scraping needs.&lt;/p&gt;
&lt;p&gt;video explanations:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://youtu.be/HH0Q9OfKLu0" rel="nofollow noopener noreferrer"&gt;https://youtu.be/HH0Q9OfKLu0&lt;/a&gt;
&lt;a href="https://youtu.be/P2HhnFRXm-I" rel="nofollow noopener noreferrer"&gt;https://youtu.be/P2HhnFRXm-I&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Other reading:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="nofollow noopener noreferrer"&gt;https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/&lt;/a&gt;
&lt;a href="https://www.zyte.com/blog/supercharging-web-scraping-with-claude-skills/" rel="nofollow noopener noreferrer"&gt;https://www.zyte.com/blog/supercharging-web-scraping-with-claude-skills/&lt;/a&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h4 class="heading-element"&gt;Other claude tools for web scraping&lt;/h4&gt;

&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://github.com/apscrapes/zyte-fetch-page-content-mcp-server" rel="noopener noreferrer"&gt;zyte-fetch-page-content-mcp-server&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;A Model Context Protocol (MCP) server that runs locally using docker desktop mcp-toolkit and help you extracts clean, LLM-friendly content from any webpage using the Zyte API. Perfect for AI assistants that need to read and understand web content. by &lt;a href="https://github.com/apscrapes" rel="noopener noreferrer"&gt;Ayan Pahwa&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol start="2"&gt;
&lt;li&gt;&lt;a href="https://joshuaodmark.com/blog/improve-claude-code-webfetch-with-zyte-api" rel="nofollow noopener noreferrer"&gt;Improve Claude Code WebFetch with Zyte API&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;When Claude encounters a WebFetch failure, it reads the CLAUDE.md instructions and makes a curl request to the Zyte API endpoint. The API returns base64-encoded HTML, which Claude decodes and processes just like it would with a normal WebFetch response. by &lt;a href="https://joshuaodmark.com/" rel="nofollow noopener noreferrer"&gt;Joshua Odmark&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol start="3"&gt;
&lt;li&gt;&lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="nofollow noopener noreferrer"&gt;Claude skills vs MCP vs Web Scraping CoPilot&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;



&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  What is a skill?
&lt;/h2&gt;

&lt;p&gt;A skill is a small markdown file that tells Claude how to use a specific script or tool — what it does, when to use it, and step-by-step how to run it. Claude reads the file and follows the instructions as part of a broader workflow, with no manual intervention required.&lt;/p&gt;

&lt;p&gt;Skills are composable by design. The fetcher skill hands raw HTML to the parser skill, which hands structured JSON to the compare skill. Each one does one job well, and they're built to work together.&lt;/p&gt;

&lt;p&gt;The parser skill's front matter sets out its purpose immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;parser&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracts&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;structured&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;product&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;HTML.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Tries&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;JSON-LD&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;via"&lt;/span&gt;
&lt;span class="s"&gt;Extruct first, falls back to CSS selectors via Parsel.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Two methods, one fallback. That single description line captures the entire logic of the skill.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/HH0Q9OfKLu0"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;
&lt;h2&gt;
  
  
  What the parser skill does
&lt;/h2&gt;

&lt;p&gt;The parser skill takes raw HTML as input and returns a structured JSON object. It uses two extraction methods in sequence, trying the more reliable one first and falling back to the more flexible one if needed.&lt;/p&gt;

&lt;p&gt;The primary method uses Extruct to find JSON-LD data embedded in the page. JSON-LD is a structured data format that many modern sites include in their HTML specifically to make their content machine-readable — it's used for search engine optimisation and data portability. When it's present, Extruct can read it cleanly and reliably, with no need to write or maintain selectors.&lt;/p&gt;

&lt;p&gt;If no usable JSON-LD is found, the skill falls back to Parsel, which uses CSS selectors to locate data heuristically across the page. This is more flexible but inherently tied to the page's visual structure, which can change.&lt;/p&gt;
&lt;h2&gt;
  
  
  When to use it
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## When to use&lt;/span&gt;
Use this skill when you have raw HTML and need to extract structured data from
it — product details, prices, specs, ratings, or any page content.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In practice, that means the parser skill is almost always the second step in a pipeline — running immediately after the fetcher skill has retrieved your HTML. It works with any page type, and handles the most common product fields out of the box.&lt;/p&gt;
&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Save the HTML to a temporary file &lt;span class="sb"&gt;`page.html`&lt;/span&gt;
&lt;span class="p"&gt;2.&lt;/span&gt; Run &lt;span class="sb"&gt;`parser.py`&lt;/span&gt; against it:
   python parser.py page.html
&lt;span class="p"&gt;
3.&lt;/span&gt; The script outputs a JSON object. Check the &lt;span class="sb"&gt;`method`&lt;/span&gt; field:
&lt;span class="p"&gt;   -&lt;/span&gt; "extruct" — clean structured data was found, use it directly
&lt;span class="p"&gt;   -&lt;/span&gt; "parsel" — fell back to CSS selectors, review fields for completeness
&lt;span class="p"&gt;4.&lt;/span&gt; If key fields are missing from the Parsel output, ask the user which fields
   they need and re-run with --fields:
   python parser.py page.html --fields "price,rating,brand"
&lt;span class="p"&gt;
5.&lt;/span&gt; Return the parsed JSON to the conversation for use in the Compare skill.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;method&lt;/code&gt; field in the output is particularly useful. It tells you immediately how the data was extracted and how much trust to place in it. An &lt;code&gt;"extruct"&lt;/code&gt; result is clean and stable. A &lt;code&gt;"parsel"&lt;/code&gt; result is worth reviewing, especially if you're working with an unusual page layout.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--fields&lt;/code&gt; flag is a practical escape hatch. Rather than requiring you to dig into the script when key data is missing, it lets you specify exactly what you need and re-run — a much more efficient loop.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why prefer Extruct?
&lt;/h2&gt;

&lt;p&gt;The notes section of the skill file makes this explicit:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Notes&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Always prefer the Extruct path — it is more stable and requires no maintenance
&lt;span class="p"&gt;-&lt;/span&gt; Parsel selectors are generated heuristically and may need adjustment for
  unusual page layouts
&lt;span class="p"&gt;-&lt;/span&gt; Run once per page; pass all outputs together into the Compare skill
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Parsel selectors break when sites redesign. JSON-LD, by contrast, is structured data the site publishes independently of its visual layout. A site can completely overhaul its design and its JSON-LD will often remain untouched. That stability is worth prioritising wherever possible.&lt;/p&gt;
&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;Once you've run the parser skill across all your target pages, you have a set of structured JSON objects ready to compare. That's where the compare skill picks up — generating tables, summaries, and side-by-side analysis from the extracted data.&lt;/p&gt;
&lt;h2&gt;
  
  
  Do you need a skill?
&lt;/h2&gt;

&lt;p&gt;The parser skill works well when the data you need maps cleanly onto fields that Extruct or Parsel can find — product names, prices, ratings, and similar structured attributes that sites commonly expose through JSON-LD or consistent HTML patterns. For that category of task, the skill is fast to apply and requires no custom code.&lt;/p&gt;

&lt;p&gt;See our post about Skills vs MCP vs Web Scraping Copilot (our VS Code Extension):&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;zyte.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;





&lt;p&gt;But not every extraction problem fits that mould. If you're working with pages that don't include JSON-LD and have highly irregular layouts, Parsel's heuristic selectors may return incomplete or inconsistent results, and you'll spend time debugging field by field. In those cases, a purpose-built extraction script using Parsel or BeautifulSoup directly — with selectors you've written and tested against the specific target — will be more reliable.&lt;/p&gt;

&lt;p&gt;For larger-scale or more complex extraction work, Zyte API's automatic extraction capabilities go further still. Rather than relying on selectors at all, automatic extraction uses AI to identify and return structured data from a page without requiring you to specify fields or maintain selector logic. If you're extracting data from many different site structures, or you need extraction to keep working through site redesigns without manual intervention, that's a more robust foundation than a skill-based approach. The parser skill is best understood as a practical middle ground: fast to use, good enough for a wide range of common cases, and easy to slot into a pipeline — but not a replacement for extraction tooling built for scale or resilience.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>skills</category>
      <category>webscraping</category>
      <category>zyte</category>
    </item>
    <item>
      <title>I gave Claude access to a web scraping API</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 11 Mar 2026 07:10:43 +0000</pubDate>
      <link>https://dev.to/extractdata/the-fetcher-skill-reliable-html-scraping-with-automatic-fallback-5f02</link>
      <guid>https://dev.to/extractdata/the-fetcher-skill-reliable-html-scraping-with-automatic-fallback-5f02</guid>
      <description>&lt;p&gt;If you've worked with Claude for any length of time, you've probably noticed it can do a lot more than answer questions. With the right setup, it can take actions — running scripts, processing files, working through multi-step workflows autonomously. Skills are what make that possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a skill?
&lt;/h2&gt;

&lt;p&gt;A skill is a small, self-contained instruction set that tells Claude how to use a specific tool or script to accomplish a well-defined task. Technically, it's a markdown file — a &lt;code&gt;SKILL.md&lt;/code&gt; — that describes what a tool does, when to reach for it, and exactly how to run it. Claude reads that file and follows the instructions as part of a larger workflow.&lt;/p&gt;

&lt;p&gt;Skills are designed to be composable. Each one does one thing well, and they're built to hand off to each other. The fetcher skill retrieves HTML. The parser skill extracts data from it. The compare skill turns multiple parsed outputs into a structured comparison. Together, they form a complete scraping pipeline — and Claude orchestrates the whole thing.&lt;/p&gt;

&lt;p&gt;See our skills here: &lt;a href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;https://github.com/zytelabs/claude-webscraping-skills&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The skill format looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetcher&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetches&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;HTML&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;using&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;httpx,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;automatic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fallback&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Zyte&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;if&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;blocked."&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That front matter is what Claude uses to match the right skill to the right task. The description is deliberately precise: it tells Claude not just what the skill does, but how it does it, so Claude can reason about whether it's the right tool for the job.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/HH0Q9OfKLu0"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  What the fetcher skill does
&lt;/h2&gt;

&lt;p&gt;The fetcher skill's job is exactly what it sounds like: given a URL, fetch the raw HTML and return it. It uses httpx as its primary HTTP client — a modern, performant Python library well suited to scraping workloads.&lt;/p&gt;

&lt;p&gt;What makes it more than a simple wrapper is the fallback logic. A significant number of sites actively block automated requests. Without a fallback, a blocked request just fails, and you're left manually diagnosing why. The fetcher skill handles this automatically. If a request comes back with a &lt;code&gt;BLOCKED&lt;/code&gt; status, it retries via Zyte API, which provides built-in unblocking. Most of the time, you get your HTML without ever needing to intervene.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use it
&lt;/h2&gt;

&lt;p&gt;The skill's &lt;code&gt;SKILL.md&lt;/code&gt; is explicit about this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## When to use&lt;/span&gt;
Use this skill when the user provides one or more URLs and asks you to fetch,
retrieve, scrape, or get the HTML or page content.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, that means any time you're starting a scraping or data extraction task and you have a URL to work from. It's the entry point for the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;The instructions in the skill file are straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Run &lt;span class="sb"&gt;`fetcher.py`&lt;/span&gt; with the URL as an argument:
   python fetcher.py &lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;
2.&lt;/span&gt; If the script returns a successful HTML response, return the HTML to the
   conversation for use in the next step.
&lt;span class="p"&gt;3.&lt;/span&gt; If the script returns a &lt;span class="sb"&gt;`BLOCKED`&lt;/span&gt; status, re-run with the &lt;span class="sb"&gt;`--zyte`&lt;/span&gt; flag:
   python fetcher.py &lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt; --zyte
&lt;span class="p"&gt;
4.&lt;/span&gt; Inform the user if a URL could not be fetched after both attempts.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two-step process keeps things efficient. httpx is fast and lightweight, so it handles the majority of requests without needing to route through Zyte API. The fallback only kicks in when it's needed. If both attempts fail, Claude surfaces that to you clearly rather than silently moving on.&lt;/p&gt;

&lt;p&gt;For multiple URLs, the script runs once per URL — there's no batching — so Claude loops through a list sequentially.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transparency about failure
&lt;/h2&gt;

&lt;p&gt;One detail worth highlighting is the final instruction: inform the user if a URL could not be fetched after both attempts. This might seem obvious, but it reflects a design principle worth being explicit about. A skill that silently drops failed URLs would produce incomplete data downstream, and you might not notice until you're looking at a comparison table with missing rows. Surfacing failures immediately keeps the pipeline honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;The fetcher skill's output is raw HTML — exactly what the parser skill expects as its input. The two are designed to be used in sequence. Once you have the HTML, the parser skill takes over, extracting structured data through JSON-LD or CSS selectors depending on what the page contains.&lt;/p&gt;

&lt;p&gt;That handoff is documented in the skill's notes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Notes&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; For multiple URLs, run the script once per URL
&lt;span class="p"&gt;-&lt;/span&gt; Pass the raw HTML output into the Parser skill for extraction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline continues from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do you need a skill?
&lt;/h2&gt;

&lt;p&gt;Skills are a good fit when you have a well-defined, repeatable task that benefits from consistent behaviour across many runs. Fetching HTML from a URL is a clear example: the inputs and outputs are predictable, the fallback logic is always the same, and packaging that into a skill means Claude applies it reliably without you having to re-explain the process each time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="noopener noreferrer"&gt;Read our break down of Skills vs MCP vs Web Scraping Copilot here - our VS Code extension&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That said, skills aren't always the right tool. If you only need to fetch a handful of pages once, asking Claude to write a quick httpx script directly may be faster and more flexible. Similarly, if your target sites have unusual behaviour — rate limiting, JavaScript rendering, login walls, or multi-step navigation — a bespoke Scrapy spider built with Zyte API gives you far more control than a general-purpose fetch wrapper. Scrapy's middleware architecture, item pipelines, and scheduling make it better suited to large-scale or complex crawls where you need precise control over every aspect of the request cycle.&lt;/p&gt;

&lt;p&gt;The fetcher skill sits in the middle: more structured than an ad hoc script, less complex than a full Scrapy project. It's the right choice when you want Claude to handle straightforward retrieval as part of a larger automated workflow, without the overhead of setting up and maintaining a dedicated spider.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>skills</category>
      <category>webscraping</category>
      <category>zyte</category>
    </item>
    <item>
      <title>I built a Claude Code skill that screenshots any website (and it handles anti-bot sites too)</title>
      <dc:creator>Ayan Pahwa</dc:creator>
      <pubDate>Fri, 06 Mar 2026 14:40:32 +0000</pubDate>
      <link>https://dev.to/extractdata/i-built-a-claude-code-skill-that-screenshots-any-website-and-it-handles-anti-bot-sites-too-2m4b</link>
      <guid>https://dev.to/extractdata/i-built-a-claude-code-skill-that-screenshots-any-website-and-it-handles-anti-bot-sites-too-2m4b</guid>
      <description>&lt;p&gt;TLDR;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Automate screenshot capture for any URL with JavaScript rendering and anti-ban protection — straight from your AI assistant.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/P2HhnFRXm-I"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;Taking a screenshot of a webpage sounds trivial, until you need to do it at scale. Modern websites throw every obstacle imaginable in your way: JavaScript-rendered content that only appears after a React bundle loads, bot-detection systems that serve blank pages to automated headless browsers, geo-blocked content, and CAPTCHAs that appear the moment traffic patterns look non-human. For a handful of URLs you can get away with Puppeteer or Playwright. For hundreds or thousands? You need infrastructure built for the job.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnca4lwviajb7vavqwio5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnca4lwviajb7vavqwio5.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Zyte API was designed specifically for this problem. It handles JavaScript rendering, anti-bot fingerprinting, rotating proxies, and headless browser management so you don't have to and what better way to do it straight from your LLM supplying the URLs? Hence I created this zyte-screenshots Claude Skill, which you can use to trigger the entire workflow- API call, base64 decode, PNG save on your filesystem, all just by chatting with Claude.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll walk through exactly how the skill works, how to set it up, and how to use it to capture production-quality screenshots of any URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Use the Zyte API for Screenshots?
&lt;/h2&gt;

&lt;p&gt;Before diving into the skill itself, it's worth understanding what makes the Zyte API uniquely suited to screenshot capture at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Full JavaScript Rendering
&lt;/h3&gt;

&lt;p&gt;Single-page applications built with React, Vue, Angular, or Next.js don't serve their content in the raw HTML response, they render it client-side after the page loads. Tools that capture the raw HTTP response will get a blank shell. Zyte's screenshot endpoint fires a real headless browser, waits for the DOM to fully settle, then captures the final rendered state.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Anti-Bot and Anti-Ban Protection
&lt;/h3&gt;

&lt;p&gt;Enterprise-grade sites use fingerprinting libraries to detect automation. They check TLS fingerprints, browser headers, canvas rendering patterns, mouse movement entropy, and dozens of other signals. Zyte's infrastructure is battle-tested to pass these checks so your screenshots won't return a "Access Denied" page.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Scale Without Infrastructure
&lt;/h3&gt;

&lt;p&gt;Managing a fleet of headless browser instances, proxy rotation, retries, and residential IP pools is a serious engineering investment. Zyte abstracts all of this into a single API call.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. One API, Any URL
&lt;/h3&gt;

&lt;p&gt;Whether the target is a static HTML page, a JS-heavy SPA, a behind-login dashboard (with session cookies), or a geo-restricted site, the same API call structure works. The skill you're about to install uses this endpoint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.zyte.com/account/signup/zyteapi" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9a7qsf6b03h1rrfzw5g.png" alt="Zyte API signup"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is the zyte-screenshots Claude Skill?
&lt;/h2&gt;

&lt;p&gt;Claude Skills are reusable instruction packages that extend Claude's capabilities with domain-specific workflows. The &lt;strong&gt;zyte-screenshots&lt;/strong&gt; skill teaches Claude how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accept a URL from the user in natural language&lt;/li&gt;
&lt;li&gt;Read the ZYTE_API_KEY environment variable&lt;/li&gt;
&lt;li&gt;Construct and execute the correct curl command against &lt;a href="https://api.zyte.com/v1/extract" rel="noopener noreferrer"&gt;https://api.zyte.com/v1/extract&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Pipe the JSON response through jq and base64 --decode to produce a PNG file&lt;/li&gt;
&lt;li&gt;Derive a clean filename from the URL (e.g. &lt;a href="https://quotes.toscrape.com" rel="noopener noreferrer"&gt;https://quotes.toscrape.com&lt;/a&gt; becomes quotes.toscrape.png)&lt;/li&gt;
&lt;li&gt;Report the exact file path and describe what's visible in the screenshot in one sentence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, this means you can open Claude, say &lt;strong&gt;"screenshot &lt;a href="https://example.com" rel="noopener noreferrer"&gt;https://example.com&lt;/a&gt;"&lt;/strong&gt;, and have a pixel-perfect PNG on your filesystem in seconds, no browser, no script, no Puppeteer config.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before installing the skill, make sure you have the following:&lt;/p&gt;

&lt;h3&gt;
  
  
  Tools
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;curl&lt;/strong&gt;: Pre-installed on macOS and most Linux distributions. On Windows, use WSL or Git Bash.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;jq&lt;/strong&gt;: A lightweight JSON processor. Install via &lt;code&gt;brew install jq&lt;/code&gt; (macOS) or &lt;code&gt;sudo apt install jq&lt;/code&gt; (Ubuntu/Debian).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;base64&lt;/strong&gt;: Standard on all Unix-like systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude desktop app&lt;/strong&gt; with Skills support enabled.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A Zyte API Key
&lt;/h3&gt;

&lt;p&gt;Sign up at &lt;a href="https://www.zyte.com/" rel="noopener noreferrer"&gt;zyte.com&lt;/a&gt; and navigate to your API credentials. The free tier includes enough credits to get started with testing. Copy your API key, you'll set it as an environment variable.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro tip:&lt;/strong&gt; Set your ZYTE_API_KEY in your shell profile (~/.zshrc or ~/.bashrc) so it's always available: &lt;code&gt;export ZYTE_API_KEY="your_key_here"&lt;/code&gt; or pass it along your prompt&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Installing the Skill
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Download the Skill from GitHub
&lt;/h3&gt;

&lt;p&gt;The skill is open source and available at &lt;a href="https://github.com/apscrapes/claude-zyte-screenshots" rel="noopener noreferrer"&gt;github.com/apscrapes/claude-zyte-screenshots&lt;/a&gt;. Download the latest release ZIP from the repository's Releases page, or clone it directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/apscrapes/claude-zyte-screenshots.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Import into Claude
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Open the Claude desktop app or go to Claude.ai&lt;/li&gt;
&lt;li&gt;Navigate to Settings → Skills&lt;/li&gt;
&lt;li&gt;Click Import Skill&lt;/li&gt;
&lt;li&gt;Select the SKILL.md file from the downloaded ZIP&lt;/li&gt;
&lt;li&gt;Claude will confirm the skill has been installed&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 3: Set Your API Key
&lt;/h3&gt;

&lt;p&gt;In your terminal, export your Zyte API key or pass it along your prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_zyte_api_key_here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To make this permanent across terminal sessions, add the export line to your &lt;code&gt;~/.zshrc&lt;/code&gt; or &lt;code&gt;~/.bashrc&lt;/code&gt; file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using the Skill: Step-by-Step
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Basic Screenshot
&lt;/h3&gt;

&lt;p&gt;Once the skill is installed and your API key is set, using it is as simple as asking Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Screenshot https://quotes.toscrape.com using zyte-screenshots skill
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Verify ZYTE_API_KEY is available in the environment&lt;/li&gt;
&lt;li&gt;Derive the output filename: quotes.toscrape.png&lt;/li&gt;
&lt;li&gt;Execute the following command:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://api.zyte.com/v1/extract &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ZYTE_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;: &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://quotes.toscrape.com",
    "screenshot": true
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
| jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.screenshot'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
| &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;--decode&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; quotes.toscrape.png
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Verify the file is non-empty&lt;/li&gt;
&lt;li&gt;Open and inspect the image&lt;/li&gt;
&lt;li&gt;Return the file location and a one-line visual description&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example output from Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Screenshot captured!

📁 Location: /home/you/quotes.toscrape.png

🖼️  What's in it: A clean light-themed page listing inspirational quotes
   with author attributions and tag filters in the sidebar.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Under the Hood: How the API Call Works
&lt;/h2&gt;

&lt;p&gt;Let's break down the exact curl command the skill executes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://api.zyte.com/v1/extract &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ZYTE_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;: &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://target-site.com",
    "screenshot": true
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
| jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.screenshot'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
| &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;--decode&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; output.png
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;curl -s&lt;/code&gt;&lt;/strong&gt; — Silent mode; suppresses progress output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;-u "$ZYTE_API_KEY":&lt;/code&gt;&lt;/strong&gt; — HTTP Basic Auth. Zyte uses the API key as the username with an empty password.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;-H "Content-Type: application/json"&lt;/code&gt;&lt;/strong&gt; — Tells the API to expect a JSON body.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;-d '{...}'&lt;/code&gt;&lt;/strong&gt; — The JSON request body. Setting &lt;code&gt;screenshot: true&lt;/code&gt; instructs Zyte to return a base64-encoded PNG of the fully rendered page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;| jq -r '.screenshot'&lt;/code&gt;&lt;/strong&gt; — Extracts the raw base64 string from the JSON response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;| base64 --decode&lt;/code&gt;&lt;/strong&gt; — Decodes the base64 string into binary PNG data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;&amp;gt; output.png&lt;/code&gt;&lt;/strong&gt; — Writes the binary data to a PNG file.&lt;/p&gt;

&lt;p&gt;The Zyte API handles everything in between — spinning up a headless Chromium instance, loading the page with real browser fingerprints, waiting for JavaScript execution to complete, and rendering the final DOM to a pixel buffer.&lt;/p&gt;

&lt;p&gt;This was a fun weekend project I put together, let me know your thoughts on our Discord and feel free to play around with it. I'd also love to know if you create any useful claude skills or mcp server, so say hi on our discord.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: web scraping • Zyte API • screenshots at scale • JavaScript rendering • anti-bot • Claude AI • Claude Skills • automation • headless browser • site APIs&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>webscraping</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why your Python request gets 403 Forbidden</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 04 Mar 2026 09:11:29 +0000</pubDate>
      <link>https://dev.to/extractdata/why-your-python-request-gets-403-forbidden-4h3p</link>
      <guid>https://dev.to/extractdata/why-your-python-request-gets-403-forbidden-4h3p</guid>
      <description>&lt;p&gt;If you’ve had your HTTP request blocked despite using correct headers, cookies, and clean IPs, there’s a chance you are running into one of the simplest forms of blocking, and one of the most confusing for beginners.&lt;/p&gt;

&lt;p&gt;Chances are, you will recognise the problem. You found the hidden API, and your request works perfectly in Postman... but it fails instantly within your Python code.&lt;/p&gt;

&lt;p&gt;It’s called TLS fingerprinting. But the good news is, you can solve it. In fact, when I showed this to some developers at &lt;a href="https://www.extractsummit.io/" rel="noopener noreferrer"&gt;Extract Summit&lt;/a&gt;, they couldn’t believe how straightforward it was to fix.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gl264gftdhxminz531h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gl264gftdhxminz531h.png" alt="bruno request" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CAPTION: “I copied the request -&amp;gt; matching headers, cookies and IP, but it still failed?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Your TLS fingerprint&lt;br&gt;&lt;br&gt;
Let’s start with a question. How do the servers and websites know you’ve moved from Postman to making the request in Python? What do they see that you can’t? The key is your TLS fingerprint.&lt;/p&gt;

&lt;p&gt;To use an analogy: We’ve effectively written a different name on a sticker and stuck it to our t-shirt, hoping to get past the bouncer at a bar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your nametag (headers)&lt;/strong&gt; says "Chrome."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But your t-shirt logo (TLS handshake)&lt;/strong&gt; very obviously says "Python."&lt;/p&gt;

&lt;p&gt;It’s a dead giveaway. This mismatch is spotted immediately. We need to change our t-shirt to match the nametag.&lt;/p&gt;

&lt;p&gt;To understand &lt;em&gt;how&lt;/em&gt; they spot the “logo”, we need to look at the initial &lt;strong&gt;“Client Hello”&lt;/strong&gt; packet. There are three key pieces of information exchanged here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cipher suites:&lt;/strong&gt; The encryption methods the client supports.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLS extensions:&lt;/strong&gt; Extra features (like specific elliptic curves).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key exchange algorithms:&lt;/strong&gt; How they agree on a password.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is because Python’s &lt;code&gt;requests&lt;/code&gt; library uses &lt;strong&gt;OpenSSL&lt;/strong&gt;, while Chrome uses Google's &lt;strong&gt;BoringSSL&lt;/strong&gt;. While they share some underlying logic, their signatures are notably different. And that’s the problem.&lt;/p&gt;
&lt;h3&gt;
  
  
  OpenSSL vs. BoringSSL
&lt;/h3&gt;

&lt;p&gt;The root cause of this mismatch lies in the underlying libraries.&lt;/p&gt;

&lt;p&gt;Python’s &lt;code&gt;requests&lt;/code&gt; library relies on &lt;strong&gt;OpenSSL&lt;/strong&gt;, the standard cryptographic library found on almost every Linux server. It is robust, predictable, and remarkably consistent.&lt;/p&gt;

&lt;p&gt;Chrome, however, uses &lt;strong&gt;BoringSSL&lt;/strong&gt;, Google’s own fork of OpenSSL. BoringSSL is designed specifically for the chaotic nature of the web and it behaves very differently.&lt;/p&gt;

&lt;p&gt;The biggest giveaway between the two is a mechanism called &lt;strong&gt;GREASE&lt;/strong&gt; (Generate Random Extensions And Sustain Extensibility).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="s2"&gt;"TLS_GREASE (0xFAFA)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="err"&gt;....&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Chrome (BoringSSL) intentionally inserts random, garbage values into the TLS handshake - specifically, in the cipher suites and extensions lists. It does this to "grease the joints" of the internet, ensuring that servers don't crash when they encounter unknown future parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is one of the key changes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chrome:&lt;/strong&gt; Always includes these random GREASE values (e.g., &lt;code&gt;0x0a0a&lt;/code&gt;).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python (OpenSSL):&lt;/strong&gt; &lt;em&gt;Never&lt;/em&gt; includes them. It only sends valid, known ciphers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, when an anti-bot system sees a handshake claiming to be "Chrome 120" but lacking these random GREASE values, it knows instantly that it is dealing with a script. It’s not just that your shirt has the wrong logo; it’s that your shirt is &lt;em&gt;too clean&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  JA3 hash
&lt;/h2&gt;

&lt;p&gt;Anti-bot companies take all that handshake data and combine it into a single string called a &lt;strong&gt;JA3 fingerprint&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Salesforce invented this years ago to detect malware, but it found its way into our industry as a simple, effective way to fingerprint HTTP requests. Security vendors have built databases of these fingerprints.&lt;/p&gt;

&lt;p&gt;It is relatively straightforward to identify and block &lt;em&gt;any&lt;/em&gt; request coming from Python’s default library because its JA3 hash is static and well-known.&lt;/p&gt;

&lt;p&gt;This code snippet would yield the below JSON response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_ja3_info&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tls.peet.ws/api/clean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note the lack of akamai_hash&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="s2"&gt;"771,4866-4867-4865-49196-49200-49195-49199-52393-52392-49188-49192-49187-49191-159-158-107-103-255,0-11-10-16-22-2
3-49-13-43-45-51-21,29-23-30-25-24-256-257-258-259-260,0-1-2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja3_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a48c0d5f95b1ef98f560f324fd275da1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja4"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t13d1812h1_85036bcba153_375ca2c5e164"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja4_r"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="s2"&gt;"t13d1812h1_0067,006b,009e,009f,00ff,1301,1302,1303,c023,c024,c027,c028,c02b,c02c,c02f,c030,cca8,cca9_000a,000b,000
d,0016,0017,002b,002d,0031,0033_0403,0503,0603,0807,0808,0809,080a,080b,0804,0805,0806,0401,0501,0601,0303,0301,030
2,0402,0502,0602"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"akamai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"akamai_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peetprint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="s2"&gt;"772-771|1.1|29-23-30-25-24-256-257-258-259-260|1027-1283-1539-2055-2056-2057-2058-2059-2052-2053-2054-1025-1281-15
37-771-769-770-1026-1282-1538|1||4866-4867-4865-49196-49200-49195-49199-52393-52392-49188-49192-49187-49191-159-158
-107-103-255|0-10-11-13-16-21-22-23-43-45-49-51"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peetprint_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"76017c4a71b7a055fb2a9a5f70f05112"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Putting the above JA3 hash into ja3.zone clearly shows this is a python3 request, using urllib3:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmrqcpccqc08rcm5wz34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmrqcpccqc08rcm5wz34.png" alt="JA3 Zone" width="800" height="622"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s the solution?
&lt;/h2&gt;

&lt;p&gt;As mentioned, simply changing headers and IP addresses won’t make a difference, as these are not part of the TLS handshake. We need to change the ciphers and Extensions to be more like what a browser would send.&lt;/p&gt;

&lt;p&gt;The best way to achieve this in Python is to swap &lt;code&gt;requests&lt;/code&gt; for a modern, TLS-friendly library like &lt;strong&gt;curl_cffi&lt;/strong&gt; or &lt;strong&gt;rnet&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here is how easy it is to switch to &lt;strong&gt;curl_cffi&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;curl_cffi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="c1"&gt;# note the impersonate argument &amp;amp; import above
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_ja3_info&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tls.peet.ws/api/clean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;impersonate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chrome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"akamai_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"52d84b11737d980aef856699f885ca86"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytpcxmljsdlf7nq02rq3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytpcxmljsdlf7nq02rq3.png" alt="Our new hash" width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CAPTION: Note - I searched via the akamai_hash here as the fingerprint from the JA3 hash wasn’t in this particular database.&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;By adding that &lt;code&gt;impersonate&lt;/code&gt; parameter, you are effectively putting on the correct t-shirt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Make &lt;code&gt;curl_cffi&lt;/code&gt; or &lt;code&gt;rnet&lt;/code&gt; your default HTTP library in Python. This should be your first port of call before spinning up a full headless browser.&lt;/p&gt;

&lt;p&gt;A simple change (which brings benefits like async capabilities) means you don’t fall foul of TLS fingerprinting. &lt;code&gt;curl-cffi&lt;/code&gt; even has a &lt;code&gt;requests&lt;/code&gt;-like API, meaning it's often a drop-in replacement.&lt;/p&gt;

</description>
      <category>api</category>
      <category>networking</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Hybrid scraping: The architecture for the modern web</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 25 Feb 2026 17:05:36 +0000</pubDate>
      <link>https://dev.to/extractdata/hybrid-scraping-the-architecture-for-the-modern-web-4p9h</link>
      <guid>https://dev.to/extractdata/hybrid-scraping-the-architecture-for-the-modern-web-4p9h</guid>
      <description>&lt;p&gt;If you scrape the modern web, you probably know the pain of the &lt;strong&gt;JavaScript challenge&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Before you can access any data, the website forces your browser to execute a snippet of JavaScript code. It calculates a result, sends it back to an endpoint for verification, and often captures extensive fingerprinting data in the process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl32q8pbypgq6l22kner6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl32q8pbypgq6l22kner6.png" alt="browser checks" width="800" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you pass this test, the server assigns you a &lt;strong&gt;session cookie&lt;/strong&gt;. This cookie acts as your "access pass." It tells the website, &lt;em&gt;"This user has passed the challenge,"&lt;/em&gt; so you don’t have to re-run the JavaScript test on every single page load.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ot3jm7249g98cuwbno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ot3jm7249g98cuwbno.png" alt="devtool shows storage token" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For web scrapers, this mechanism creates a massive inefficiency.&lt;/p&gt;

&lt;p&gt;It &lt;em&gt;looks&lt;/em&gt; like you are forced to use a headless browser (like Puppeteer or Playwright) for every single request just to handle that initial check. But browsers are heavy, they are slow and they consume massive amounts of RAM and bandwidth.&lt;/p&gt;

&lt;p&gt;Running a browser for thousands of requests can quickly become an infrastructure nightmare. You end up paying for CPU cycles just to render a page when all you wanted was the JSON payload.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution: Hybrid scraping
&lt;/h3&gt;

&lt;p&gt;The answer to this problem is a technique I’ve started calling &lt;strong&gt;hybrid scraping&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This involves using the browser &lt;em&gt;only&lt;/em&gt; to open the initial request, grab the cookie, and create a session. Once you have them, you extract that session data and hand it over to a standard, lightweight HTTP client.&lt;/p&gt;

&lt;p&gt;This architecture gives you the &lt;strong&gt;access&lt;/strong&gt; of a browser with the &lt;strong&gt;speed and efficiency&lt;/strong&gt; of a script.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing this in Python
&lt;/h2&gt;

&lt;p&gt;To build this in Python, we need two specific packages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A browser:&lt;/strong&gt; We will use &lt;strong&gt;ZenDriver&lt;/strong&gt;, a modern wrapper for headless Chrome that handles the "undetected" configuration for us.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP client:&lt;/strong&gt; We will use &lt;a href="https://github.com/0x676e67/rnet" rel="noopener noreferrer"&gt;&lt;strong&gt;rnet&lt;/strong&gt;&lt;/a&gt;, a Rust-based HTTP client for Python.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;But why &lt;a href="https://github.com/0x676e67/rnet" rel="noopener noreferrer"&gt;rnet&lt;/a&gt;?&lt;/em&gt; Well, within the initial TLS handshake where the client/server “hello” is sent, the information traded here can be fingerprinted, taking in things like the TLS version and the ciphers available for encryption. This can be hashed into a fingerprint and profiled.&lt;/p&gt;

&lt;p&gt;Python’s &lt;a href="https://github.com/psf/requests" rel="noopener noreferrer"&gt;requests&lt;/a&gt; package, which uses &lt;a href="https://docs.python.org/3/library/urllib.html" rel="noopener noreferrer"&gt;urllib&lt;/a&gt; from the standard library, has a very distinctive TLS fingerprint, containing ciphers (amongst other things) that aren’t seen in a browser. This makes it very easy to spot. Both &lt;a href="https://github.com/0x676e67/rnet" rel="noopener noreferrer"&gt;rnet&lt;/a&gt;, and other options such as &lt;a href="https://github.com/lexiforest/curl_cffi" rel="noopener noreferrer"&gt;curl-cffi&lt;/a&gt;, are able to send a TLS fingerprint similar to that of a browser. This reduces the chances of our request being blocked.&lt;/p&gt;

&lt;p&gt;Here is how we assemble the pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Load the page (The handshake)
&lt;/h3&gt;

&lt;p&gt;First, we define our browser logic. Notice that we are not trying to parse HTML here. Our only goal is to visit the site, pass the initial JavaScript challenge, and extract the session cookies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zendriver&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Use ZenDriver to launch a browser, navigate to the page, 
    and retrieve the cookies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Hit the homepage to trigger the check
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Wait briefly for the JS challenge to complete
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

    &lt;span class="c1"&gt;# Extract the cookies
&lt;/span&gt;    &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What’s happening here:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We launch the browser, visit the site, and wait just one second for the JS challenge to run. Once we have the cookies, we call browser.stop(). This is the most important line: we do not want a browser instance wasting resources when we don’t need it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Use the cookies
&lt;/h3&gt;

&lt;p&gt;Now that we have the "access pass," we can switch to our lightweight HTTP client. We take those cookies and inject them into the rnet client headers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rnet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Emulation&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Make a fast request using RNet with the borrowed cookies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;referer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Format the browser cookies into a simple HTTP header string
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookie_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cookie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# We use Emulation.Chrome142 to change the TLS Fingerprint.
&lt;/span&gt;    &lt;span class="c1"&gt;# This is site dependent - but worth using
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;emulation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Emulation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chrome142&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/api/products?page=1&amp;amp;limit=8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What’s happening here:&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;We convert the browser's cookie format into a standard header string. Note the “&lt;em&gt;Emulation.Chrome142”&lt;/em&gt; parameter. We are layering two techniques here: hybrid scraping (using real cookies) and TLS fingerprinting (using a modern HTTP client). This double-layer approach covers all our bases.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Note: Many HTTP clients have a cookie jar that you could also use; for this example, sending the header directly worked perfectly).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Run the code
&lt;/h3&gt;

&lt;p&gt;Finally, we tie it together. For this demo, we use a simple argparse flag to show the difference with and without the cookie.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="c1"&gt;# The Decision Logic
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Run the heavy browser
&lt;/span&gt;
    &lt;span class="c1"&gt;# Always run the fast HTTP client
&lt;/span&gt;    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Status Code:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response Body:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request blocked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Get the complete script
&lt;/h2&gt;

&lt;p&gt;Want to run this yourself? We’ve put the full, copy-pasteable script (including the argument parsers and imports) in the block below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv init
uv add zendriver rnet rich
&lt;span class="c"&gt;# linux/mac&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="c"&gt;# windows&lt;/span&gt;
.venv&lt;span class="se"&gt;\S&lt;/span&gt;cripts&lt;span class="se"&gt;\a&lt;/span&gt;ctivate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zendriver&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rnet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Emulation&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rich&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="k"&gt;print&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Make an HTTP GET request using rnet with the provided cookies. Cookies are sent in the headers. Note for this site we need the referer too.
    Return the Response Object.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;referer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookie_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Adjust based on the actual structure of the cookie object from zendriver
&lt;/span&gt;            &lt;span class="c1"&gt;# If it's a dict: cookie['name'], cookie['value']
&lt;/span&gt;            &lt;span class="c1"&gt;# If it's an object: cookie.name, cookie.value
&lt;/span&gt;            &lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cookie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;emulation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Emulation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chrome142&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/api/products?page=1&amp;amp;limit=8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Use zendriver to launch a browser, navigate to a page, and retrieve cookies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Status Code:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response Body:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Make HTTP request with optional browser cookies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--cookies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Set to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; to launch browser and get cookies, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; to skip (default: false)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pros and Cons of Hybrid Scraping
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reduces RAM usage massively compared to pure browser scraping.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Higher complexity:&lt;/strong&gt; You must manage two libraries (zendriver and rnet) and the glue code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP requests complete in milliseconds. Browsers take seconds.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;State management:&lt;/strong&gt; You need logic to handle cookie expiry. If the cookie dies, you must "wake up" the browser.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You get the verification of a real browser without the drag.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Maintenance:&lt;/strong&gt; You are debugging two points of failure: the browser's ability to solve the challenge, and the client's ability to fetch data.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Final thoughts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For smaller jobs, it might be easier to just use the browser; the benefits won’t necessarily outweigh the extra complexity required.&lt;/p&gt;

&lt;p&gt;But for production pipelines, &lt;strong&gt;this approach is the standard.&lt;/strong&gt; It treats the browser as a luxury resource: used only when strictly necessary to unlock the door, so the HTTP client can do the real work. It’s this session and state management that allows you to scrape harder-to-access sites effectively and efficiently.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If building this orchestration layer yourself feels like too much overhead, this is exactly what the &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;&lt;strong&gt;Zyte API&lt;/strong&gt;&lt;/a&gt; handles internally. We manage the browser/HTTP switching logic automatically, so you just make a single request and get the data.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>javascript</category>
      <category>webdev</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Raspberry Pi &amp; E-ink scrapes &amp; displays the price of Gold today</title>
      <dc:creator>Ayan Pahwa</dc:creator>
      <pubDate>Tue, 24 Feb 2026 08:39:31 +0000</pubDate>
      <link>https://dev.to/extractdata/how-i-trade-gold-using-e-ink-live-data-and-an-old-raspberry-pi-42ag</link>
      <guid>https://dev.to/extractdata/how-i-trade-gold-using-e-ink-live-data-and-an-old-raspberry-pi-42ag</guid>
      <description>&lt;p&gt;They say that “data is the new oil”, but there’s another hot commodity that’s setting markets alight - precious metals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtbqu72m01w8auk29wfd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtbqu72m01w8auk29wfd.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the last 12 months, the &lt;a href="https://uk.finance.yahoo.com/quote/GC%3DF/" rel="noopener noreferrer"&gt;value of gold&lt;/a&gt; has surged about 75%, while &lt;a href="https://www.gold.co.uk/silver-price/" rel="noopener noreferrer"&gt;silver has boomed&lt;/a&gt; more than 200%. That’s why I, like a growing number of others, now trade in the metal markets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdanyckktjsbtvn1e2b3p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdanyckktjsbtvn1e2b3p.png" alt=" " width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These days, it is possible to buy &lt;em&gt;digital&lt;/em&gt; versions of precious metals. But I think of myself as a &lt;em&gt;collector&lt;/em&gt; - I like to buy &lt;em&gt;real&lt;/em&gt;, solid coins or bullions whenever I get a chance.&lt;/p&gt;

&lt;p&gt;In the last two years, I have acquired a small collection of gold bullions and silver coins, which have appreciated healthily. But I am not planning to sell and book a profit just yet. In fact, I want to buy more, especially when there’s a dip in the price.&lt;/p&gt;

&lt;p&gt;There’s just one problem that hits this hobby - prices of actual physical gold and silver bullions are very different in the retail market from stock exchanges’ spot prices and keeping track of them manually is cumbersome specially with a full time job.&lt;/p&gt;

&lt;p&gt;To take advantage of the dips and price arbitrage, I need to &lt;em&gt;automate&lt;/em&gt; my decisions. To buy gold old-style, I need a key resource from the modern trading toolset - data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turn data into gold
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="http://gjc.org.in/" rel="noopener noreferrer"&gt;All-India Gem And Jewellery Domestic Council&lt;/a&gt; (GJC), a national trade federation for the promotion and growth of trade in gems and jewellery across, is the go-to site listing latest retail rates for gold and silver.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgjjdai9hw0oyihva84n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgjjdai9hw0oyihva84n.png" alt=" " width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alas, it doesn’t offer an API to access that data. But fear not - with web scraping skills and &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;, I can extract these prices quickly and regularly.&lt;/p&gt;

&lt;p&gt;And I can do it using some of the tech I love to tinker with.&lt;/p&gt;

&lt;p&gt;I call it &lt;a href="https://github.com/apscrapes/ExtractToInk" rel="noopener noreferrer"&gt;ExtractToInk&lt;/a&gt; - a custom project that pulls the latest prices on a two-inch, 250x122 e-ink display powered by a retired Raspberry Pi (total cost under US$50).&lt;/p&gt;

&lt;p&gt;This is the story of how I power my quest for rapid riches using cheap old hardware and the &lt;a href="https://www.zyte.com/blog/zyte-leads-proxyway-2025-web-scraping-api-report/" rel="noopener noreferrer"&gt;world’s best web scraping engine&lt;/a&gt; - and how you can, too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mining for data
&lt;/h2&gt;

&lt;p&gt;Like many modern sites, GJC’s includes both JavaScript rendering for HTML and protection mechanisms- technologies that can break brittle traditional scraping solutions.&lt;/p&gt;

&lt;p&gt;This project connects all the dots:&lt;/p&gt;

&lt;p&gt;Web → Extract → Parse → Render → Physical display&lt;/p&gt;

&lt;h3&gt;
  
  
  Tech stack
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Hardware&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raspberry Pi (tested on Pi Zero 2 W), it should run on any Raspberry Pi Board
&lt;/li&gt;
&lt;li&gt;Pimoroni Inky pHAT (Black, SSD1608)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3
&lt;/li&gt;
&lt;li&gt;Zyte API: to get rendered HTML
&lt;/li&gt;
&lt;li&gt;BeautifulSoup: to parse HTML
&lt;/li&gt;
&lt;li&gt;Pillow and Inky Python libraries: for e-ink display stuff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now let’s get building.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Prepare hardware
&lt;/h2&gt;

&lt;p&gt;Setup your Raspberry Pi. In my case, I am using Raspberry Pi OS &lt;a href="https://www.raspberrypi.com/documentation/computers/getting-started.html" rel="noopener noreferrer"&gt;booted from the SD card&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furl2uiive3xweo4opha6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furl2uiive3xweo4opha6.png" alt=" " width="800" height="532"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Depending on which display you use, it most probably will be connected to the Pi over i2c bus or SPI bus protocol - so, enable your display type by entering:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sudo raspi-config&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now attach your e-ink display and do a quick reboot &lt;/p&gt;

&lt;p&gt;You might need to install libraries to use your e-ink display.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkd77cowp2x5kj1hmvupf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkd77cowp2x5kj1hmvupf.png" alt=" " width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Fetching rendered HTML with Zyte API
&lt;/h2&gt;

&lt;p&gt;The source site, GJC, renders prices dynamically, using JavaScript - something which can make plain HTTP requests unreliable.&lt;/p&gt;

&lt;p&gt;No problem. By accessing the page through Zyte API, we can set &lt;code&gt;browserHTML&lt;/code&gt; mode to return the page content as though rendered in an actual browser.&lt;/p&gt;

&lt;p&gt;Instead of fighting JavaScript, we let Zyte handle it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;html = requests.post(`  
    `"https://api.zyte.com/v1/extract",`  
    `auth=(ZYTE_API_KEY, ""),`  
    `json={"url": URL, "browserHtml": True},`  
`).json()["browserHtml"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: there is no Selenium here, and no headless browsers. This is much more reliable for production-style scraping&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Parsing with CSS selectors
&lt;/h2&gt;

&lt;p&gt;Once we have clean HTML, parsing becomes straightforward.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkwbp7ffd758ci8uu01r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkwbp7ffd758ci8uu01r.png" alt=" " width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Gold prices
&lt;/h3&gt;

&lt;p&gt;Let’s locate the actual prices in the page mark-up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for row in soup.select(".gold_rate table tr"):`  
    `label = row.select_one("td strong")`  
    `values = row.select("td strong")`

    `if not label or len(values) &amp;lt; 2:`  
        `continue`

    `text = label.get_text(strip=True)`  
    `priceText = values[1].get_text()`

    `if "Standard Rate Buying" in text:`  
        `goldBuying = re.search(r"\d[\d,]*", priceText).group(0)`

    `if "Standard Rate Selling" in text:`  
        `goldSelling = re.search(r"\d[\d,]*", priceText).group(0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’re deliberately using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSS selectors (easy to find from your browser’s DevTools).
&lt;/li&gt;
&lt;li&gt;Minimal regular expressions (only for numeric extraction).
&lt;/li&gt;
&lt;li&gt;Defensive checks to avoid brittle parsing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Silver prices
&lt;/h3&gt;

&lt;p&gt;Silver appears outside the main table, so we filter it carefully:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for strong in soup.select("p &amp;gt; strong"):`  
    `text = strong.get_text(" ", strip=True)`

    `if "Standard Rate Selling" in text and not strong.find_parent("table"):`  
        `silver = re.search(r"\d[\d,]*", text).group(0)`  
        `break
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Rendering for e-ink
&lt;/h2&gt;

&lt;p&gt;For this project, I did not want to pipe data into a web dashboard on a computer monitor.&lt;/p&gt;

&lt;p&gt;E-ink is always-on, low power, distraction-free and perfect for “ambient information” like this.&lt;/p&gt;

&lt;p&gt;So, it’s a great fit for data like prices, weather, status indicators and system health.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwp8mf7r0lr5iibk3h8b5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwp8mf7r0lr5iibk3h8b5.png" alt=" " width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But e-ink displays are not normal screens.&lt;/p&gt;

&lt;p&gt;They are typically black-and-white, have high contrast and are slow to refresh.&lt;/p&gt;

&lt;p&gt;What’s more, no two e-ink displays are made the same way. Every vendor has different support packages so, whichever you end up using, make sure to read the documentation and change the code accordingly.&lt;/p&gt;

&lt;p&gt;In my case, I am using &lt;a href="https://learn.pimoroni.com/article/getting-started-with-inky-phat" rel="noopener noreferrer"&gt;Pimoroni inky PHat&lt;/a&gt;. The supplied Python library has great built-in examples to get you quickly up and running. I used the helper function to render texts on the display, ex, the build in draw.text() function comes handy:&lt;/p&gt;

&lt;h3&gt;
  
  
  Draw silver selling price
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    draw.text((x, y), f"Silver : {silverPrice}", fill=(0, 0, 0), font=fontBig)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Taking it furtherSection about the finished product
&lt;/h2&gt;

&lt;p&gt;I built this project to use web data thoughtfully, connecting it to the physical world, and building pipelines that feel calm, reliable, and purposeful. When I am at my work-desk the project actively tells me the current prices so I can buy new coins if I see a price drop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdf1n493o7cimspvfjrjw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdf1n493o7cimspvfjrjw.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I can further extend this to place automatic orders on the website and secure me a coin at my desired strike price. &lt;/p&gt;

&lt;p&gt;If you want to take this further, you could also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run it via &lt;code&gt;cron&lt;/code&gt; every 10 minutes : The website I am targeting only refreshes prices twice a day, so my cron job runs every 12 hours, but, if you need faster data, you can scrape a site with more real-time updates.
&lt;/li&gt;
&lt;li&gt;Add more commodities or currencies.
&lt;/li&gt;
&lt;li&gt;Turn it into a &lt;code&gt;systemd&lt;/code&gt; service to run at start time.
&lt;/li&gt;
&lt;li&gt;Swap e-ink for another output (PDF, LED, dashboard).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re exploring Zyte API, or looking for real-world scraping examples beyond CSVs and JSON files, this project is a great place to start.&lt;/p&gt;

&lt;p&gt;You can get my code in the &lt;a href="https://github.com/apscrapes/ExtractToInk" rel="noopener noreferrer"&gt;ExtractToInk GitHub repository&lt;/a&gt; now.&lt;/p&gt;

</description>
      <category>python</category>
      <category>raspberrypi</category>
      <category>webdev</category>
      <category>sideprojects</category>
    </item>
  </channel>
</rss>
