Welcome to Part 3 of our Modern Scrapy series.
That refactor was a huge improvement, but it was still a lot of manual work. We had to:
- Manually create our
BookItemandBookListPageschemas. - Manually create the
bookstoscrape_com.pyPage Object file. - Manually use
scrapy shellto find all the CSS selectors. - Manually write all the
@fieldparsers.
What if you could do all of that in about 30 seconds?
In this guide, we'll show you how to use the Web Scraping Co-pilot (our VS Code extension) to automatically write 100% of your Items, Page Objects, and even your unit tests. We'll take our simple spider from Part 1 and upgrade it to the professional scrapy-poet architecture from Part 2, but this time, the AI will do all the heavy lifting.
Prerequisites & Setup
This tutorial assumes you have:
- Completed Part 1 (see above)
- Visual Studio Code installed.
- The Web Scraping Co-pilot extension (which we'll install now).
Step 1: Installing Web Scraping Co-pilot
Inside VS Code, go to the "Extensions" tab and search for Web Scraping Co-pilot (published by Zyte).
Once installed, you'll see a new icon in your sidebar. Open it, and it will automatically detect your Scrapy project. It may ask to install a few dependencies like pytest—allow it to do so. This setup process ensures your environment is ready for AI-powered generation.
Step 2: Auto-Generating our BookItem
Let's start with the spider from Part 1. Our goal is to create a Page Object for our BookItem and add even more fields than we did in Part 2.
In the Co-pilot chat window:
- Select "Web Scraping."
-
Write a prompt like this:
"Create a page object for the item BookItem using the sample URL https://books.toscrape.com/catalogue/the-host_979/index.html"
The co-pilot will now:
-
Check your project: It will confirm you have
scrapy-poetandpytest(and will offer to install them if you don't). -
Add
scrapy-poetsettings: It will automatically add theADDONSandSCRAPY_POET_DISCOVERsettings to yoursettings.pyfile. -
Create your
items.py: It will create a newBookItemclass, but this time it will intelligently add all the fields it can find on the page.
# tutorial/items.py (Auto-Generated!)
import attrs
@attrs.define
class BookItem:
"""
The structured data we extract from a book *detail* page.
"""
name: str
price: str
url: str
availability: str # <-- New!
number_of_reviews: int # <-- New!
upc: str # <-- New!
-
Create Fixtures: It creates a
fixturesfolder with the saved HTML and expected JSON output for testing. -
Write the Page Object: It creates the
tutorial/pages/bookstoscrape_com.pyfile and writes the entire Page Object, complete with all parsing logic and selectors, for all the new fields.
# tutorial/pages/bookstoscrape_com.py (Auto-Generated!)
from web_poet import WebPage, handle_urls, field, returns
from tutorial.items import BookItem
@handle_urls("[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)")
@returns(BookItem)
class BookDetailPage(WebPage):
"""
This Page Object handles parsing data from book detail pages.
"""
@field
def name(self) -> str:
return self.response.css("h1::text").get()
@field
def price(self) -> str:
return self.response.css("p.price_color::text").get()
@field
def url(self) -> str:
return self.response.url
# All of this was written for us!
@field
def availability(self) -> str:
return self.response.css("p.availability::text").getall()[1].strip()
@field
def number_of_reviews(self) -> int:
return int(self.response.css("table tr:last-child td::text").get())
@field
def upc(self) -> str:
return self.response.css("table tr:first-child td::text").get()
In 30 seconds, the Co-pilot has done everything we did manually in Part 2, but better—it even added more fields.
Step 3: Running the AI-Generated Tests
The best part? The Co-pilot also wrote unit tests for you. It created a tests folder with test_bookstoscrape_com.py.
You can just click "Run Tests" in the Co-pilot UI (or run pytest in your terminal).
$ pytest
================ test session starts ================
...
tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED
tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED
...
================ 8 tests passed in 0.10s ================
Your parsing logic is now fully tested, and you didn't write a single line of test code.
Step 4: Refactoring the Spider (The Easy Way)
Now, we just update our tutorial/spiders/books.py to use this new architecture, just like in Part 2.
# tutorial/spiders/books.py
import scrapy
# Import our new, auto-generated Item class
from tutorial.items import BookItem
class BooksSpider(scrapy.Spider):
name = "books"
# ... (rest of spider from Part 1) ...
async def parse_listpage(self, response):
product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
for url in product_urls:
# We just tell Scrapy to call parse_book
yield response.follow(url, callback=self.parse_book)
next_page_url = response.css("li.next a::attr(href)").get()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse_listpage)
# We ask for the BookItem, and scrapy-poet does the rest!
async def parse_book(self, response, book: BookItem):
yield book
Step 5: Auto-Generating our BookListPage
We can repeat the exact same process for our list page to finish the refactor.
Prompt the Co-pilot:
"Create a page object for the list item BookListPage using the sample URL https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"
Result:
- The Co-pilot will create the
BookListPageitem initems.py. - It will create the
BookListPageObjectinbookstoscrape_com.pywith the parsers forbook_urlsandnext_page_url. - It will write and pass the tests.
Now we can update our spider one last time to be fully architected.
# tutorial/spiders/books.py (FINAL VERSION)
import scrapy
from tutorial.items import BookItem, BookListPage # Import both
class BooksSpider(scrapy.Spider):
# ... (name, allowed_domains, url) ...
async def start(self):
yield scrapy.Request(self.url, callback=self.parse_listpage)
# We now ask for the BookListPage item!
async def parse_listpage(self, response, page: BookListPage):
# All parsing logic is GONE from the spider.
for url in page.book_urls:
yield response.follow(url, callback=self.parse_book)
if page.next_page_url:
yield response.follow(page.next_page_url, callback=self.parse_listpage)
async def parse_book(self, response, book: BookItem):
yield book
Our spider is now just a "crawler." It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Co-pilot.
Conclusion: The "Hybrid Developer"
The Web Scraping Co-pilot doesn't replace you. It accelerates you. It automates the 90% of work that is "grunt work" (finding selectors, writing boilerplate, creating tests) so you can focus on the 10% of work that matters: crawling logic, strategy, and handling complex sites.
This is how we, as the maintainers of Scrapy, build spiders professionally.
What's Next? Join the Community.
What's Next? Join the Community.
💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.
▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.
📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.

Top comments (0)