<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zyte</title>
    <description>The latest articles on DEV Community by Zyte (@zyte).</description>
    <link>https://dev.to/zyte</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F9835%2Fe20b30c7-ba8b-497a-9fd7-0203f288b459.png</url>
      <title>DEV Community: Zyte</title>
      <link>https://dev.to/zyte</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zyte"/>
    <language>en</language>
    <item>
      <title>Stop Scraping HTML - There's a better way.</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Tue, 16 Dec 2025 19:16:43 +0000</pubDate>
      <link>https://dev.to/zyte/stop-scraping-html-theres-a-better-way-34nl</link>
      <guid>https://dev.to/zyte/stop-scraping-html-theres-a-better-way-34nl</guid>
      <description>&lt;p&gt;&lt;strong&gt;The "API-First" Reverse Engineering Method&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the most common mistakes I see developers make is firing up their code editor too early. They open VS Code, &lt;code&gt;pip install requests beautifulsoup4&lt;/code&gt;, and immediately start trying to parse &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; tags.&lt;/p&gt;

&lt;p&gt;If you are scraping a modern e-commerce site or Single Page Application (SPA), this is the wrong approach. It’s brittle, it’s slow, and it breaks the moment the site updates its CSS.&lt;/p&gt;

&lt;p&gt;The secret to scalable scraping isn't better parsing; it's finding the API that the website uses to populate itself. Here is the exact workflow I use to turn a complex parsing job into a clean, reliable JSON pipeline.&lt;/p&gt;




&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/7nHqyTbK5K0"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: The Discovery (XHR Filtering)
&lt;/h2&gt;

&lt;p&gt;Modern websites are rarely static. They typically use a "Frontend/Backend" architecture where the browser loads a skeleton page and then fetches the actual data via a background API call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your goal is to use that call.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Open Developer Tools:&lt;/strong&gt; Right-click and inspect the page, then navigate to the &lt;strong&gt;Network&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Filter the Noise:&lt;/strong&gt; Click the &lt;strong&gt;Fetch/XHR&lt;/strong&gt; filter. We don't care about CSS, images, or fonts. We only care about data.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Trigger the Request:&lt;/strong&gt; Refresh the page. Watch the waterfall.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa36wcl032q8u5iy02pmk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa36wcl032q8u5iy02pmk.png" alt="Find the request"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If nothing appears of note here, try different pages, try triggering pagination, loading and clicking buttons and watching to see what appears.&lt;/p&gt;

&lt;p&gt;You are looking for requests that return JSON. They are often named intuitively, like &lt;code&gt;graphql&lt;/code&gt;, &lt;code&gt;search&lt;/code&gt;, &lt;code&gt;products&lt;/code&gt;, or &lt;code&gt;api&lt;/code&gt;. When you click "Preview" on these requests, you won't see HTML; you will see a structured object containing every piece of data you need—prices, descriptions, SKU numbers—already parsed and clean.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro Tip:&lt;/strong&gt; Once you find a candidate URL, test it immediately in the browser console or URL bar. Try changing query parameters like &lt;code&gt;page=1&lt;/code&gt; to &lt;code&gt;page=2&lt;/code&gt;. If the JSON response changes to show the next page of products, you have found your "Golden Endpoint."&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Phase 2: The "Clean Room" Isolation
&lt;/h2&gt;

&lt;p&gt;Finding the endpoint is only step one. Now you need to determine the minimum viable request required to access it programmatically.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Copy as cURL:&lt;/strong&gt; Right-click the request in Chrome DevTools and select &lt;em&gt;Copy as cURL&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Import to a Client:&lt;/strong&gt; Open an API client like Bruno, Postman, or Insomnia. Import the cURL command.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;The Baseline Test:&lt;/strong&gt; Hit "Send." It should work perfectly because you are sending everything—every cookie, every header, and the exact session token your browser just generated.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v89ljhkswdp3t3m0qdx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v89ljhkswdp3t3m0qdx.png" alt="Add the request to Bruno, Postman, or similar"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Load-Bearing" Header Game
&lt;/h3&gt;

&lt;p&gt;Efficient scrapers don't send 2KB of headers. You need to strip this down. Start unchecking headers one by one and resending the request:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Remove the Cookie header:&lt;/strong&gt; Does it break? (Usually, yes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remove the Referer:&lt;/strong&gt; Does it break? (Often, yes—sites check this to ensure the request came from their own frontend).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remove the User-Agent:&lt;/strong&gt; Does it break?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check the Parameters:&lt;/strong&gt; Can you change &lt;code&gt;limit=10&lt;/code&gt; to &lt;code&gt;limit=100&lt;/code&gt; to get more data in one shot?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Eventually, you will be left with the "skeleton key": the absolute minimum headers required to get a &lt;code&gt;200 OK&lt;/code&gt;. Usually, this consists of a User-Agent, a Referer, and a specific Auth Token or Session Cookie.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfy16tpr6rwpsgbfvklv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfy16tpr6rwpsgbfvklv.png" alt="Headers in Bruno"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3: The Infrastructure Trap (The "Bonded" Token)
&lt;/h2&gt;

&lt;p&gt;This is where most developers hit a wall. You take your cleaned-up request, put it into a Python script, and... &lt;code&gt;403 Forbidden&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Why? You have the right URL and the right headers.&lt;/p&gt;

&lt;p&gt;In my analysis of modern scraping targets, I found that the API endpoint is increasingly performing a &lt;strong&gt;Cryptographic Binding&lt;/strong&gt; check.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The IP Link:&lt;/strong&gt; The Auth Token/Cookie you copied from your browser was generated for that specific IP address. When you run your script (likely on a server, VPN, or different proxy), the site sees a mismatch between the token's origin IP and your current request IP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Expiry Clock:&lt;/strong&gt; These tokens are ephemeral. They are designed to expire, you will need to investigate how long.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are just looping through a list of URLs with a static token, you will burn out your access almost immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4: Architecting the Solution
&lt;/h2&gt;

&lt;p&gt;To make this work at scale, you cannot simply write a script. You need to build a &lt;strong&gt;Hybrid Architecture&lt;/strong&gt; that manages state.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjfrdjef30hc4d11gpuk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjfrdjef30hc4d11gpuk.png" alt="Architecture Image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You need to engineer a system that takes the above into account and monitor its lifecycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Storage Unit:&lt;/strong&gt; You need a database (like Redis) to store a "Session Object." This object must contain:

&lt;ul&gt;
&lt;li&gt;The Auth Token (Cookie).&lt;/li&gt;
&lt;li&gt;The IP Address used to generate it.&lt;/li&gt;
&lt;li&gt;The Creation Time.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;The Browser Worker:&lt;/strong&gt; You need a headless browser (&lt;code&gt;Nodriver&lt;/code&gt;/&lt;code&gt;Camoufox&lt;/code&gt;) to visit the site, execute the JavaScript, generate the token, and save it to your Storage Unit.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;The HTTP Worker:&lt;/strong&gt; Your actual scraper. It doesn't browse; it pulls the Token + IP combination from storage and hits the API directly.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;The Rotation Logic:&lt;/strong&gt; You need logic that checks the token age.

&lt;ul&gt;
&lt;li&gt;Is the token older than 5 minutes? Stop.&lt;/li&gt;
&lt;li&gt;Spin up the Browser Worker.&lt;/li&gt;
&lt;li&gt;Generate a new Token.&lt;/li&gt;
&lt;li&gt;Update the Storage Unit.&lt;/li&gt;
&lt;li&gt;Resume scraping.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Hidden Overhead
&lt;/h2&gt;

&lt;p&gt;Suddenly, your simple scraping job requires a Proxy Management System (to ensure the Browser and HTTP worker share the same IP), a Browser Management System (to handle the heavy lifting of token generation), and a State Manager.&lt;/p&gt;

&lt;p&gt;This is why "just scraping the API" is harder than it looks. The code to fetch the data is minimal—often just one function. But the infrastructure required to maintain the identity required to access that data is massive.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/?utm_source=devto&amp;amp;utm_medium=post_cta&amp;amp;utm_campaign=zyte_api" rel="noopener noreferrer"&gt;Zyte&lt;/a&gt;&lt;/strong&gt;, we abstract this entire architecture. Our API handles the browser fingerprinting, the IP, and the session rotation automatically. You simply send us the URL, and we handle the "Hybrid" complexity in the background, delivering you the clean JSON response without the infrastructure headache.&lt;/p&gt;

&lt;p&gt;Want more? &lt;a href="https://www.zyte.com/join-community/?utm_source=devto&amp;amp;utm_medium=post_cta&amp;amp;utm_campaign=joincommunity" rel="noopener noreferrer"&gt;Join our community&lt;/a&gt;&lt;/p&gt;

</description>
      <category>api</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Tue, 16 Dec 2025 18:55:04 +0000</pubDate>
      <link>https://dev.to/zyte/the-modern-scrapy-developers-guide-part-3-auto-generating-page-objects-with-web-scraping-59fj</link>
      <guid>https://dev.to/zyte/the-modern-scrapy-developers-guide-part-3-auto-generating-page-objects-with-web-scraping-59fj</guid>
      <description>&lt;p&gt;Welcome to Part 3 of our Modern Scrapy series.&lt;/p&gt;

&lt;p&gt;That refactor was a huge improvement, but it was still a lot of &lt;em&gt;manual&lt;/em&gt; work. We had to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Manually create our &lt;code&gt;BookItem&lt;/code&gt; and &lt;code&gt;BookListPage&lt;/code&gt; schemas.&lt;/li&gt;
&lt;li&gt;Manually create the &lt;code&gt;bookstoscrape_com.py&lt;/code&gt; Page Object file.&lt;/li&gt;
&lt;li&gt;Manually use &lt;code&gt;scrapy shell&lt;/code&gt; to find all the CSS selectors.&lt;/li&gt;
&lt;li&gt;Manually write all the &lt;code&gt;@field&lt;/code&gt; parsers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What if you could do all of that in about 30 seconds?&lt;/p&gt;

&lt;p&gt;In this guide, we'll show you how to use the &lt;strong&gt;Web Scraping Co-pilot&lt;/strong&gt; (our VS Code extension) to &lt;strong&gt;automatically write 100% of your Items, Page Objects, and even your unit tests.&lt;/strong&gt; We'll take our simple spider from Part 1 and upgrade it to the professional &lt;code&gt;scrapy-poet&lt;/code&gt; architecture from Part 2, but this time, the AI will do all the heavy lifting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites &amp;amp; Setup
&lt;/h2&gt;

&lt;p&gt;This tutorial assumes you have:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Completed Part 1 (see above)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual Studio Code&lt;/strong&gt; installed.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Web Scraping Co-pilot&lt;/strong&gt; extension (which we'll install now).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 1: Installing Web Scraping Co-pilot
&lt;/h2&gt;

&lt;p&gt;Inside VS Code, go to the "Extensions" tab and search for &lt;code&gt;Web Scraping Co-pilot&lt;/code&gt; (published by Zyte).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfdxrrer9xtx4by0evfk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfdxrrer9xtx4by0evfk.png" alt="Web Scraping Copilot" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once installed, you'll see a new icon in your sidebar. Open it, and it will automatically detect your Scrapy project. It may ask to install a few dependencies like &lt;code&gt;pytest&lt;/code&gt;—allow it to do so. This setup process ensures your environment is ready for AI-powered generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Auto-Generating our &lt;code&gt;BookItem&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Let's start with the spider from Part 1. Our goal is to create a Page Object for our &lt;code&gt;BookItem&lt;/code&gt; and add &lt;em&gt;even more fields&lt;/em&gt; than we did in Part 2.&lt;/p&gt;

&lt;p&gt;In the Co-pilot chat window:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select "Web Scraping."&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Write a prompt like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Create a page object for the item BookItem using the sample URL &lt;a href="https://books.toscrape.com/catalogue/the-host_979/index.html" rel="noopener noreferrer"&gt;https://books.toscrape.com/catalogue/the-host_979/index.html&lt;/a&gt;"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The co-pilot will now:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check your project:&lt;/strong&gt; It will confirm you have &lt;code&gt;scrapy-poet&lt;/code&gt; and &lt;code&gt;pytest&lt;/code&gt; (and will offer to install them if you don't).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add &lt;code&gt;scrapy-poet&lt;/code&gt; settings:&lt;/strong&gt; It will automatically add the &lt;code&gt;ADDONS&lt;/code&gt; and &lt;code&gt;SCRAPY_POET_DISCOVER&lt;/code&gt; settings to your &lt;code&gt;settings.py&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create your &lt;code&gt;items.py&lt;/code&gt;:&lt;/strong&gt; It will create a new &lt;code&gt;BookItem&lt;/code&gt; class, but this time it will &lt;em&gt;intelligently add all the fields it can find on the page&lt;/em&gt;.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/items.py (Auto-Generated!)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;

&lt;span class="nd"&gt;@attrs.define&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The structured data we extract from a book *detail* page.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- New!
&lt;/span&gt;    &lt;span class="n"&gt;number_of_reviews&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="c1"&gt;# &amp;lt;-- New!
&lt;/span&gt;    &lt;span class="n"&gt;upc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;             &lt;span class="c1"&gt;# &amp;lt;-- New!
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create Fixtures:&lt;/strong&gt; It creates a &lt;code&gt;fixtures&lt;/code&gt; folder with the saved HTML and expected JSON output for testing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write the Page Object:&lt;/strong&gt; It creates the &lt;code&gt;tutorial/pages/bookstoscrape_com.py&lt;/code&gt; file and writes the &lt;em&gt;entire&lt;/em&gt; Page Object, complete with all parsing logic and selectors, for &lt;em&gt;all&lt;/em&gt; the new fields.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/pages/bookstoscrape_com.py (Auto-Generated!)
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;web_poet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;returns&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tutorial.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt;

&lt;span class="nd"&gt;@handle_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@returns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookDetailPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This Page Object handles parsing data from book detail pages.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;price&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.price_color::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;

    &lt;span class="c1"&gt;# All of this was written for us!
&lt;/span&gt;    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.availability::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;number_of_reviews&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table tr:last-child td::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table tr:first-child td::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In 30 seconds, the Co-pilot has done everything we did manually in Part 2, but &lt;em&gt;better&lt;/em&gt;—it even added more fields.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Running the AI-Generated Tests
&lt;/h2&gt;

&lt;p&gt;The best part? The Co-pilot &lt;em&gt;also&lt;/em&gt; wrote unit tests for you. It created a &lt;code&gt;tests&lt;/code&gt; folder with &lt;code&gt;test_bookstoscrape_com.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can just click "Run Tests" in the Co-pilot UI (or run &lt;code&gt;pytest&lt;/code&gt; in your terminal).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pytest
================ test session starts ================
...
tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED
tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED
...
================ 8 tests passed in 0.10s ================

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your parsing logic is now fully tested, and you didn't write a single line of test code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Refactoring the Spider (The Easy Way)
&lt;/h2&gt;

&lt;p&gt;Now, we just update our &lt;code&gt;tutorial/spiders/books.py&lt;/code&gt; to use this new architecture, just like in Part 2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tutorial/spiders/books.py

import scrapy
# Import our new, auto-generated Item class
from tutorial.items import BookItem

class BooksSpider(scrapy.Spider):
    name = "books"
    # ... (rest of spider from Part 1) ...

    async def parse_listpage(self, response):
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            # We just tell Scrapy to call parse_book
            yield response.follow(url, callback=self.parse_book)

        next_page_url = response.css("li.next a::attr(href)").get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # We ask for the BookItem, and scrapy-poet does the rest!
    async def parse_book(self, response, book: BookItem):
        yield book

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5: Auto-Generating our &lt;code&gt;BookListPage&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;We can repeat the exact same process for our list page to finish the refactor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt the Co-pilot:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Create a page object for the list item BookListPage using the sample URL &lt;a href="https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html" rel="noopener noreferrer"&gt;https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html&lt;/a&gt;"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Co-pilot will create the &lt;code&gt;BookListPage&lt;/code&gt; item in &lt;code&gt;items.py&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;It will create the &lt;code&gt;BookListPageObject&lt;/code&gt; in &lt;code&gt;bookstoscrape_com.py&lt;/code&gt; with the parsers for &lt;code&gt;book_urls&lt;/code&gt; and &lt;code&gt;next_page_url&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;It will write and pass the tests.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now we can update our spider one last time to be &lt;em&gt;fully&lt;/em&gt; architected.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tutorial/spiders/books.py (FINAL VERSION)

import scrapy
from tutorial.items import BookItem, BookListPage # Import both

class BooksSpider(scrapy.Spider):
    # ... (name, allowed_domains, url) ...

    async def start(self):
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # We now ask for the BookListPage item!
    async def parse_listpage(self, response, page: BookListPage):

        # All parsing logic is GONE from the spider.
        for url in page.book_urls:
            yield response.follow(url, callback=self.parse_book)

        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    async def parse_book(self, response, book: BookItem):
        yield book

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our spider is now just a "crawler." It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Co-pilot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: The "Hybrid Developer"
&lt;/h2&gt;

&lt;p&gt;The Web Scraping Co-pilot doesn't replace you. It &lt;em&gt;accelerates&lt;/em&gt; you. It automates the 90% of work that is "grunt work" (finding selectors, writing boilerplate, creating tests) so you can focus on the 10% of work that matters: &lt;strong&gt;crawling logic, strategy, and handling complex sites.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is how we, as the maintainers of Scrapy, build spiders professionally.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What's Next? Join the Community.&lt;/p&gt;

&lt;p&gt;What's Next? Join the Community.&lt;br&gt;
💬 &lt;a href="https://discord.com/invite/extract-data-community-993441606642446397" rel="noopener noreferrer"&gt;TALK&lt;/a&gt;: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.&lt;br&gt;
▶️ &lt;a href="https://www.youtube.com/@zytedata" rel="noopener noreferrer"&gt;WATCH&lt;/a&gt;: This post was based on our video! Watch the full walkthrough on our YouTube channel.&lt;br&gt;
📩 &lt;a href="https://www.zyte.com/join-community/" rel="noopener noreferrer"&gt;READ&lt;/a&gt;: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>tutorial</category>
      <category>scrapy</category>
      <category>programming</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>The Modern Scrapy Developer's Guide (Part 2): Page Objects with scrapy-poet</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Tue, 16 Dec 2025 18:49:10 +0000</pubDate>
      <link>https://dev.to/zyte/the-modern-scrapy-developers-guide-part-2-page-objects-with-scrapy-poet-5b6l</link>
      <guid>https://dev.to/zyte/the-modern-scrapy-developers-guide-part-2-page-objects-with-scrapy-poet-5b6l</guid>
      <description>&lt;p&gt;Welcome to Part 2 of our Modern Scrapy series. In Part 1, we built a working spider that crawls and scrapes an entire category. But if you look at our code, it's already getting messy. Our &lt;code&gt;parse_listpage&lt;/code&gt; and &lt;code&gt;parse_book&lt;/code&gt; functions are mixing two different jobs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Crawling Logic:&lt;/strong&gt; Finding the next page and following links.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parsing Logic:&lt;/strong&gt; Finding the data (name, price) with CSS selectors.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What happens when a selector changes? Or when you want to test your parsing logic? You have to run the whole spider. This is slow, hard to maintain, and difficult to test.&lt;/p&gt;

&lt;p&gt;In this guide, we'll fix this by refactoring our spider to a professional, modern standard using &lt;strong&gt;Scrapy Items&lt;/strong&gt; and &lt;strong&gt;Page Objects&lt;/strong&gt; (via &lt;code&gt;scrapy-poet&lt;/code&gt;). We will completely separate our crawling logic from our parsing logic. This will make our code cleaner, infinitely easier to test, and scalable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We'll Build
&lt;/h2&gt;

&lt;p&gt;We will refactor our spider from Part 1. The spider itself will &lt;em&gt;only&lt;/em&gt; handle crawling (following links). All the parsing logic will be moved into dedicated "Page Object" classes. &lt;code&gt;scrapy-poet&lt;/code&gt; will automatically inject the correct, parsed item into our spider.&lt;/p&gt;

&lt;p&gt;Look at how clean our spider's &lt;code&gt;parse_book&lt;/code&gt; function becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# The NEW parse_book function
# Where did the parsing logic go?! (Hint: scrapy-poet)

    async def parse_book(self, response, book: BookItem):
        # 'book' is a BookItem, magically injected and parsed
        # by scrapy-poet before this function is even called.
        # We just yield it.
        yield book

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;This tutorial builds directly on &lt;a href="https://www.zyte.com/learn/the-modern-scrapy-developers-guide/" rel="noopener noreferrer"&gt;Part 1: Building Your First Crawling Spider&lt;/a&gt;. Please complete that guide first, as we will be modifying the spider we built there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: The "Why" (Separation of Concerns)
&lt;/h2&gt;

&lt;p&gt;Our current spider is a monolith. The &lt;code&gt;BooksSpider&lt;/code&gt; class knows &lt;em&gt;how to crawl&lt;/em&gt; (find next page links, find product links) and &lt;em&gt;how to parse&lt;/em&gt; (extract &lt;code&gt;h1&lt;/code&gt; tags, extract &lt;code&gt;p.price_color&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;This is bad. If we want to reuse our parsing logic, or test it without re-crawling the web, we can't.&lt;/p&gt;

&lt;p&gt;The "Page Object" pattern solves this.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Spider's Job:&lt;/strong&gt; Crawling. Its &lt;em&gt;only&lt;/em&gt; job is to navigate from page to page and yield &lt;code&gt;Requests&lt;/code&gt; or &lt;code&gt;Items&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Page Object's Job:&lt;/strong&gt; Parsing. Its &lt;em&gt;only&lt;/em&gt; job is to take a &lt;code&gt;response&lt;/code&gt; and extract structured data from it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;scrapy-poet&lt;/code&gt; is a library that automatically connects our spider to the correct Page Object.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Create Our "Schema" (Scrapy Items)
&lt;/h2&gt;

&lt;p&gt;First, let's define the data we're scraping. Instead of messy dictionaries, we'll use Scrapy Items. Scrapy comes with &lt;code&gt;attrs&lt;/code&gt;, a fantastic library for this.&lt;/p&gt;

&lt;p&gt;Open &lt;code&gt;tutorial/items.py&lt;/code&gt; and add two classes: one for our book data and one for our list page data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/items.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;

&lt;span class="nd"&gt;@attrs.define&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The structured data we extract from a book *detail* page.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="nd"&gt;@attrs.define&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookListPage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The data and links we extract from a *list* page.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;book_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;
    &lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is our "schema." It makes our code type-safe and easier to read.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Install and Configure &lt;code&gt;scrapy-poet&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;scrapy-poet&lt;/code&gt; is a separate package we need to install.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Install scrapy-poet
uv add scrapy-poet
# or: pip install scrapy-poet

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, we must enable it in &lt;code&gt;tutorial/settings.py&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/settings.py
&lt;/span&gt;
&lt;span class="c1"&gt;# Add this to enable the scrapy-poet add-on
&lt;/span&gt;&lt;span class="n"&gt;ADDONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scrapy_poet.Addon&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Add this to tell scrapy-poet where to find our Page Objects
# 'tutorial.pages' means a folder named 'pages' in our 'tutorial' module
&lt;/span&gt;&lt;span class="n"&gt;SCRAPY_POET_DISCOVER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tutorial.pages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Create Page Objects for Parsing
&lt;/h2&gt;

&lt;p&gt;Now for the magic. Let's create the &lt;code&gt;tutorial/pages&lt;/code&gt; module.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt; &lt;span class="n"&gt;tutorial&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;
&lt;span class="n"&gt;touch&lt;/span&gt; &lt;span class="n"&gt;tutorial&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside this new folder, create a file named &lt;code&gt;bookstoscrape_com.py&lt;/code&gt;. This file will hold all the parsing logic for &lt;code&gt;bookstoscrape.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is the most complex part, but it's a "set it and forget it" pattern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/pages/bookstoscrape_com.py
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;web_poet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;returns&lt;/span&gt;

&lt;span class="c1"&gt;# Import our Item schemas
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tutorial.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BookListPage&lt;/span&gt;

&lt;span class="c1"&gt;# This class handles all book DETAIL pages
&lt;/span&gt;&lt;span class="nd"&gt;@handle_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@returns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookDetailPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This Page Object handles parsing data from book detail pages.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# The @field decorator tells scrapy-poet: "run this function
&lt;/span&gt;    &lt;span class="c1"&gt;# and put the result into the 'name' field of the BookItem."
&lt;/span&gt;    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# This is our parsing logic from Part 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;price&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# This is our parsing logic from Part 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.price_color::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;

&lt;span class="c1"&gt;# This class handles all book LIST pages (categories)
&lt;/span&gt;&lt;span class="nd"&gt;@handle_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[books.toscrape.com/catalogue/category](https://books.toscrape.com/catalogue/category)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@returns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BookListPage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookListPageObject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This Page Object handles parsing data from category/list pages.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;book_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# This is our parsing logic from Part 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article.product_pod h3 a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# This is our parsing logic from Part 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;li.next a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look at that! All our messy &lt;code&gt;response.css()&lt;/code&gt; calls are now neatly organized in their own classes, completely separate from our spider. The &lt;code&gt;@handle_urls&lt;/code&gt; decorator tells &lt;code&gt;scrapy-poet&lt;/code&gt; which Page Object to use for which URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Refactor the Spider (The Payoff)
&lt;/h2&gt;

&lt;p&gt;Now, let's go back to &lt;code&gt;tutorial/spiders/books.py&lt;/code&gt; and refactor it. It becomes &lt;em&gt;much&lt;/em&gt; simpler.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/spiders/books.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;
&lt;span class="c1"&gt;# Import our new Item classes
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tutorial.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BookListPage&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BooksSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;books&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toscrape.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# We still start the same way
&lt;/span&gt;        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# The 'page: BookListPage' is new.
&lt;/span&gt;    &lt;span class="c1"&gt;# We ask for the BookListPage item, and scrapy-poet injects it.
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BookListPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

        &lt;span class="c1"&gt;# 1. Get the parsed book URLs from the Page Object
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# We follow each URL, but our callback no longer
&lt;/span&gt;            &lt;span class="c1"&gt;# needs to do any work!
&lt;/span&gt;            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. Get the next page URL from the Page Object
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# The 'book: BookItem' is new.
&lt;/span&gt;    &lt;span class="c1"&gt;# We ask for the BookItem, and scrapy-poet injects it.
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Our parsing logic is GONE.
&lt;/span&gt;        &lt;span class="c1"&gt;# The 'book' variable is already a fully-populated
&lt;/span&gt;        &lt;span class="c1"&gt;# BookItem, parsed by our BookDetailPage Page Object.
&lt;/span&gt;
        &lt;span class="c1"&gt;# We just yield it.
&lt;/span&gt;        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;book&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our spider is now &lt;em&gt;only&lt;/em&gt; responsible for crawling. All parsing is handled by &lt;code&gt;scrapy-poet&lt;/code&gt; and our Page Objects. This code is clean, testable, and incredibly easy to read.&lt;/p&gt;

&lt;p&gt;When you run &lt;code&gt;scrapy crawl books -o books.json&lt;/code&gt;, the output will be identical to Part 1, but your architecture is now 100x better.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Hard Part": Why This Still Breaks
&lt;/h3&gt;

&lt;p&gt;We've built a professional, well-architected Scrapy spider. But we've just made a cleaner version of a spider that will still fail on a real-world site.&lt;/p&gt;

&lt;p&gt;This architecture is beautiful, but it doesn't solve the "real" problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;❌ IP Blocks:&lt;/strong&gt; You're still hitting the site from one IP. You will be blocked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;❌ CAPTCHAs:&lt;/strong&gt; You have no way to avoid captchas, and your spider will fail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;❌ JavaScript:&lt;/strong&gt; If the prices were loaded by JS, our &lt;code&gt;response.css()&lt;/code&gt; selectors would find &lt;em&gt;nothing&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We've just organized our failing code.&lt;/p&gt;

&lt;p&gt;The "Easy Way": Zyte API as a Universal Page Object&lt;br&gt;
scrapy-poet is a great way to organise your scrapy code, making your projects easier to build, collaborate and maintain. However, it doesn't change the fact we are not doing anything to avoid web scraping bans.&lt;/p&gt;

&lt;p&gt;So we can add the below settings using our Zyte API account to run our scrapy project through Zyte API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# add scrapy-zyte-api python library
uv add scrapy-zyte-api
# settings.py
ZYTE_API_KEY = "YOUR_API_KEY"

ADDONS = {
    "scrapy_zyte_api.Addon": 500,
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the power of combining a great architecture (Scrapy) with a powerful service (Zyte API).&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion &amp;amp; Next Steps
&lt;/h2&gt;

&lt;p&gt;Today you elevated your spider from a simple script to a professional-grade crawler. You learned the "Separation of Concerns" principle, defined data with &lt;code&gt;Items&lt;/code&gt;, and separated parsing logic with &lt;code&gt;scrapy-poet&lt;/code&gt;'s Page Objects.&lt;/p&gt;

&lt;p&gt;This is the modern way to build robust, testable, and scalable Scrapy spiders.&lt;/p&gt;

&lt;p&gt;What's Next? Join the Community.&lt;br&gt;
💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.&lt;br&gt;
▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.&lt;br&gt;
📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.&lt;/p&gt;

&lt;p&gt;And if you're ready to skip the "Hard Part" entirely, get your free API key and try the "Easy Way."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hubs.li/Q03YmnDF0" rel="noopener noreferrer"&gt;&lt;strong&gt;Start Your Free Zyte Trial&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>tutorial</category>
      <category>scrapy</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Tue, 16 Dec 2025 18:41:29 +0000</pubDate>
      <link>https://dev.to/zyte/the-modern-scrapy-developers-guide-part-1-building-your-first-spider-4gc2</link>
      <guid>https://dev.to/zyte/the-modern-scrapy-developers-guide-part-1-building-your-first-spider-4gc2</guid>
      <description>&lt;p&gt;Scrapy can feel daunting. It's a massive, powerful framework, and the documentation can be overwhelming for a newcomer. Where do you even begin?&lt;/p&gt;

&lt;p&gt;In this definitive guide, we will walk you through, step-by-step, how to build a real, multi-page crawling spider. You will go from an empty folder to a clean JSON file of structured data in about 15 minutes. We'll use modern, &lt;code&gt;async&lt;/code&gt;/&lt;code&gt;await&lt;/code&gt; Python and cover project setup, finding selectors, following links (crawling), and saving your data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We'll Build
&lt;/h2&gt;

&lt;p&gt;We will build a Scrapy spider that crawls the "Fantasy" category on &lt;a href="https://books.toscrape.com/" rel="noopener noreferrer"&gt;books.toscrape.com&lt;/a&gt;, follows the "Next" button to crawl every page in that category, follows the link for every book, and scrapes the name, price, and URL from all 48 books, saving the result to a clean &lt;code&gt;books.json&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Here's a preview of our final spider code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The final spider we'll build
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BooksSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;books&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toscrape.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html](https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;product_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article.product_pod h3 a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;product_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;next_page_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;li.next a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.price_color::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Prerequisites &amp;amp; Setup
&lt;/h2&gt;

&lt;p&gt;Before we start, you'll need Python 3.x installed. We'll also be using a virtual environment to keep our dependencies clean. You can use standard &lt;code&gt;pip&lt;/code&gt; or a modern package manager like &lt;code&gt;uv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;First, let's create a project folder and activate a virtual environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a new folder&lt;/span&gt;
&lt;span class="nb"&gt;mkdir &lt;/span&gt;scrapy_project
&lt;span class="nb"&gt;cd &lt;/span&gt;scrapy_project

&lt;span class="c"&gt;# Option 1: Using standard pip + venv&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate  &lt;span class="c"&gt;# On Windows, use: .venv\Scripts\activate&lt;/span&gt;

&lt;span class="c"&gt;# Option 2: Using uv (a fast, modern alternative)&lt;/span&gt;
uv init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's install Scrapy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Option 1: Using pip&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy

&lt;span class="c"&gt;# Option 2: Using uv&lt;/span&gt;
uv add scrapy
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Initialize Your Project
&lt;/h2&gt;

&lt;p&gt;With Scrapy installed, we can use its built-in command-line tools to generate our project boilerplate.&lt;/p&gt;

&lt;p&gt;First, create the project itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The 'scrapy startproject' command creates the project structure&lt;/span&gt;
&lt;span class="c"&gt;# The '.' tells it to use the current folder&lt;/span&gt;
scrapy startproject tutorial &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see a &lt;code&gt;tutorial&lt;/code&gt; folder and a &lt;code&gt;scrapy.cfg&lt;/code&gt; file appear. This folder contains all your project's logic.&lt;/p&gt;

&lt;p&gt;Next, we'll generate our first spider.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The 'genspider' command creates a new spider file&lt;/span&gt;
&lt;span class="c"&gt;# Usage: scrapy genspider &amp;lt;spider_name&amp;gt; &amp;lt;allowed_domain&amp;gt;&lt;/span&gt;
scrapy genspider books toscrape.com

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you look in &lt;code&gt;tutorial/spiders/&lt;/code&gt;, you'll now see &lt;code&gt;books.py&lt;/code&gt;. This is where we'll write our code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Configure Your Settings
&lt;/h2&gt;

&lt;p&gt;Before we write our spider, let's quickly adjust two settings in &lt;code&gt;tutorial/settings.py&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;ROBOTSTXT_OBEY&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By default, Scrapy respects robots.txt files. This is a good practice, but our test site (toscrape.com) doesn't have one, which can cause a 404 error in our logs. We'll turn it off for this tutorial.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/settings.py
&lt;/span&gt;
&lt;span class="c1"&gt;# Find this line and change it to False
&lt;/span&gt;&lt;span class="n"&gt;ROBOTSTXT_OBEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Concurrency&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Scrapy is polite by default and runs slowly. Since toscrape.com is a test site built for scraping, we can speed it up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/settings.py
&lt;/span&gt;
&lt;span class="c1"&gt;# Uncomment or add these lines
&lt;/span&gt;&lt;span class="n"&gt;CONCURRENT_REQUESTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;
&lt;span class="n"&gt;DOWNLOAD_DELAY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Warning:&lt;/strong&gt; These settings are for this test site only. When scraping in the wild, you must be mindful of your target site and use respectful &lt;code&gt;DOWNLOAD_DELAY&lt;/code&gt; and &lt;code&gt;CONCURRENT_REQUESTS&lt;/code&gt; values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Finding Our Selectors (with &lt;code&gt;scrapy shell&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;To scrape a site, we need to tell Scrapy &lt;em&gt;what&lt;/em&gt; data to get. We do this with CSS selectors. The &lt;code&gt;scrapy shell&lt;/code&gt; is the best tool for this.&lt;/p&gt;

&lt;p&gt;Let's launch the shell on our target category page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will download the page and give you an interactive shell with a &lt;code&gt;response&lt;/code&gt; object.&lt;/p&gt;

&lt;p&gt;You can even type &lt;code&gt;view(response)&lt;/code&gt; to open the page in your browser exactly as Scrapy sees it!&lt;/p&gt;

&lt;p&gt;Let's find the data we need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find all Book Links:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By inspecting the page, we see each book is in an article.product_pod. The link is inside an h3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# In scrapy shell:
&amp;gt;&amp;gt;&amp;gt; response.css("article.product_pod h3 a::attr(href)").getall()
[
  '../../../../the-host_979/index.html',
  '../../../../the-hunted_978/index.html',
  ...
]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;getall()&lt;/code&gt; gives us a clean list of all the URLs.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find the "Next" Page Link:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At the bottom, we find the "Next" button in an li.next.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# In scrapy shell:
&amp;gt;&amp;gt;&amp;gt; response.css("li.next a::attr(href)").get()
'page-2.html'

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;code&gt;get()&lt;/code&gt; gives us the single link we need for pagination.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find the Book Data (on a product page):&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally, let's open a shell on a product page to find the selectors for our data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Exit the shell and open a new one:
scrapy shell https://books.toscrape.com/catalogue/the-host_979/index.html

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# In scrapy shell:
&amp;gt;&amp;gt;&amp;gt; response.css("h1::text").get()
'The Host'

&amp;gt;&amp;gt;&amp;gt; response.css("p.price_color::text").get()
'£25.82'

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Perfect. We now have all the selectors we need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Building the Spider (Crawling &amp;amp; Parsing)
&lt;/h2&gt;

&lt;p&gt;Now, let's open &lt;code&gt;tutorial/spiders/books.py&lt;/code&gt; and write our spider. We'll use the user's provided code, as it's a clean, final version.&lt;/p&gt;

&lt;p&gt;Delete the boilerplate in &lt;code&gt;books.py&lt;/code&gt; and replace it with this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/spiders/books.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BooksSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;books&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toscrape.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# This is our starting URL (the first page of the Fantasy category)
&lt;/span&gt;    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# This is the modern, async version of 'start_requests'
&lt;/span&gt;    &lt;span class="c1"&gt;# It's called once when the spider starts.
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# We yield our first request, sending the response to 'parse_listpage'
&lt;/span&gt;        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# This function handles the *category page*
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

        &lt;span class="c1"&gt;# 1. Get all product URLs using the selector we found
&lt;/span&gt;        &lt;span class="n"&gt;product_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article.product_pod h3 a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. For each product URL, follow it and send the response to 'parse_book'
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;product_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. Find the 'Next' page URL
&lt;/span&gt;        &lt;span class="n"&gt;next_page_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;li.next a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# 4. If a 'Next' page exists, follow it and send the response
&lt;/span&gt;        &lt;span class="c1"&gt;#    back to *this same function*
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# This function handles the *product page*
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

        &lt;span class="c1"&gt;# We yield a dictionary of the data we want
&lt;/span&gt;        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.price_color::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code is clean and efficient. &lt;code&gt;response.follow&lt;/code&gt; is smart enough to handle the relative URLs (like &lt;code&gt;page-2.html&lt;/code&gt;) for us.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Running The Spider &amp;amp; Saving Data
&lt;/h2&gt;

&lt;p&gt;We're ready to run. Go to your terminal (at the project root) and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy crawl books
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see Scrapy start up, and in the logs, you'll see all 48 items being scraped!&lt;/p&gt;

&lt;p&gt;But we want to &lt;em&gt;save&lt;/em&gt; this data. Scrapy has a built-in "Feed Exporter" that makes this easy. We just use the &lt;code&gt;-o&lt;/code&gt; (output) flag.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy crawl books -o books.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will run the spider again, but this time, you'll see a new &lt;code&gt;books.json&lt;/code&gt; file in your project root, containing all 48 items, perfectly structured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion &amp;amp; Next Steps
&lt;/h2&gt;

&lt;p&gt;Today you built a powerful, modern, async Scrapy crawler. You learned how to set up a project, find selectors, follow links, and handle pagination.&lt;/p&gt;

&lt;p&gt;This is just the starting block.&lt;/p&gt;

&lt;blockquote&gt;
&lt;h3&gt;
  
  
  What's Next? Join the Community.
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;💬 TALK:&lt;/strong&gt; Stuck on this Scrapy code? &lt;a href="https://discord.com/invite/extract-data-community-993441606642446397" rel="noopener noreferrer"&gt;Ask the maintainers and 5k+ devs in our Discord.&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;▶️ WATCH:&lt;/strong&gt; This post was based on our video! &lt;a href="https://www.youtube.com/@zytedata" rel="noopener noreferrer"&gt;&lt;strong&gt;Watch the full walkthrough on our YouTube channel.&lt;/strong&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📩 READ:&lt;/strong&gt; Want more? In Part 2, we'll cover Scrapy Items and Pipelines. &lt;a href="https://www.zyte.com/join-community/" rel="noopener noreferrer"&gt;Get the Extract newsletter so you don't miss it.&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

</description>
      <category>programming</category>
      <category>webscraping</category>
      <category>scrapy</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
