DEV Community

John Rooney for Zyte

Posted on • Originally published at zyte.com

The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot

Welcome to Part 3 of our Modern Scrapy series.

That refactor was a huge improvement, but it was still a lot of manual work. We had to:

  1. Manually create our BookItem and BookListPage schemas.
  2. Manually create the bookstoscrape_com.py Page Object file.
  3. Manually use scrapy shell to find all the CSS selectors.
  4. Manually write all the @field parsers.

What if you could do all of that in about 30 seconds?

In this guide, we'll show you how to use the Web Scraping Co-pilot (our VS Code extension) to automatically write 100% of your Items, Page Objects, and even your unit tests. We'll take our simple spider from Part 1 and upgrade it to the professional scrapy-poet architecture from Part 2, but this time, the AI will do all the heavy lifting.

Prerequisites & Setup

This tutorial assumes you have:

  1. Completed Part 1 (see above)
  2. Visual Studio Code installed.
  3. The Web Scraping Co-pilot extension (which we'll install now).

Step 1: Installing Web Scraping Co-pilot

Inside VS Code, go to the "Extensions" tab and search for Web Scraping Co-pilot (published by Zyte).

Web Scraping Copilot

Once installed, you'll see a new icon in your sidebar. Open it, and it will automatically detect your Scrapy project. It may ask to install a few dependencies like pytest—allow it to do so. This setup process ensures your environment is ready for AI-powered generation.

Step 2: Auto-Generating our BookItem

Let's start with the spider from Part 1. Our goal is to create a Page Object for our BookItem and add even more fields than we did in Part 2.

In the Co-pilot chat window:

  1. Select "Web Scraping."
  2. Write a prompt like this:

    "Create a page object for the item BookItem using the sample URL https://books.toscrape.com/catalogue/the-host_979/index.html"

The co-pilot will now:

  1. Check your project: It will confirm you have scrapy-poet and pytest (and will offer to install them if you don't).
  2. Add scrapy-poet settings: It will automatically add the ADDONS and SCRAPY_POET_DISCOVER settings to your settings.py file.
  3. Create your items.py: It will create a new BookItem class, but this time it will intelligently add all the fields it can find on the page.
# tutorial/items.py (Auto-Generated!)
import attrs

@attrs.define
class BookItem:
    """
    The structured data we extract from a book *detail* page.
    """
    name: str
    price: str
    url: str
    availability: str  # <-- New!
    number_of_reviews: int # <-- New!
    upc: str             # <-- New!

Enter fullscreen mode Exit fullscreen mode
  1. Create Fixtures: It creates a fixtures folder with the saved HTML and expected JSON output for testing.
  2. Write the Page Object: It creates the tutorial/pages/bookstoscrape_com.py file and writes the entire Page Object, complete with all parsing logic and selectors, for all the new fields.
# tutorial/pages/bookstoscrape_com.py (Auto-Generated!)

from web_poet import WebPage, handle_urls, field, returns
from tutorial.items import BookItem

@handle_urls("[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)")
@returns(BookItem)
class BookDetailPage(WebPage):
    """
    This Page Object handles parsing data from book detail pages.
    """

    @field
    def name(self) -> str:
        return self.response.css("h1::text").get()

    @field
    def price(self) -> str:
        return self.response.css("p.price_color::text").get()

    @field
    def url(self) -> str:
        return self.response.url

    # All of this was written for us!
    @field
    def availability(self) -> str:
        return self.response.css("p.availability::text").getall()[1].strip()

    @field
    def number_of_reviews(self) -> int:
        return int(self.response.css("table tr:last-child td::text").get())

    @field
    def upc(self) -> str:
        return self.response.css("table tr:first-child td::text").get()

Enter fullscreen mode Exit fullscreen mode

In 30 seconds, the Co-pilot has done everything we did manually in Part 2, but better—it even added more fields.

Step 3: Running the AI-Generated Tests

The best part? The Co-pilot also wrote unit tests for you. It created a tests folder with test_bookstoscrape_com.py.

You can just click "Run Tests" in the Co-pilot UI (or run pytest in your terminal).

$ pytest
================ test session starts ================
...
tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED
tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED
...
================ 8 tests passed in 0.10s ================

Enter fullscreen mode Exit fullscreen mode

Your parsing logic is now fully tested, and you didn't write a single line of test code.

Step 4: Refactoring the Spider (The Easy Way)

Now, we just update our tutorial/spiders/books.py to use this new architecture, just like in Part 2.

# tutorial/spiders/books.py

import scrapy
# Import our new, auto-generated Item class
from tutorial.items import BookItem

class BooksSpider(scrapy.Spider):
    name = "books"
    # ... (rest of spider from Part 1) ...

    async def parse_listpage(self, response):
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            # We just tell Scrapy to call parse_book
            yield response.follow(url, callback=self.parse_book)

        next_page_url = response.css("li.next a::attr(href)").get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # We ask for the BookItem, and scrapy-poet does the rest!
    async def parse_book(self, response, book: BookItem):
        yield book

Enter fullscreen mode Exit fullscreen mode

Step 5: Auto-Generating our BookListPage

We can repeat the exact same process for our list page to finish the refactor.

Prompt the Co-pilot:

"Create a page object for the list item BookListPage using the sample URL https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"

Result:

  1. The Co-pilot will create the BookListPage item in items.py.
  2. It will create the BookListPageObject in bookstoscrape_com.py with the parsers for book_urls and next_page_url.
  3. It will write and pass the tests.

Now we can update our spider one last time to be fully architected.

# tutorial/spiders/books.py (FINAL VERSION)

import scrapy
from tutorial.items import BookItem, BookListPage # Import both

class BooksSpider(scrapy.Spider):
    # ... (name, allowed_domains, url) ...

    async def start(self):
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # We now ask for the BookListPage item!
    async def parse_listpage(self, response, page: BookListPage):

        # All parsing logic is GONE from the spider.
        for url in page.book_urls:
            yield response.follow(url, callback=self.parse_book)

        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    async def parse_book(self, response, book: BookItem):
        yield book

Enter fullscreen mode Exit fullscreen mode

Our spider is now just a "crawler." It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Co-pilot.

Conclusion: The "Hybrid Developer"

The Web Scraping Co-pilot doesn't replace you. It accelerates you. It automates the 90% of work that is "grunt work" (finding selectors, writing boilerplate, creating tests) so you can focus on the 10% of work that matters: crawling logic, strategy, and handling complex sites.

This is how we, as the maintainers of Scrapy, build spiders professionally.

What's Next? Join the Community.

What's Next? Join the Community.
💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.
▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.
📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.

Top comments (0)