DEV Community

John Rooney for Zyte

Posted on • Originally published at zyte.com

The Modern Scrapy Developer's Guide (Part 2): Page Objects with scrapy-poet

Welcome to Part 2 of our Modern Scrapy series. In Part 1, we built a working spider that crawls and scrapes an entire category. But if you look at our code, it's already getting messy. Our parse_listpage and parse_book functions are mixing two different jobs:

  1. Crawling Logic: Finding the next page and following links.
  2. Parsing Logic: Finding the data (name, price) with CSS selectors.

What happens when a selector changes? Or when you want to test your parsing logic? You have to run the whole spider. This is slow, hard to maintain, and difficult to test.

In this guide, we'll fix this by refactoring our spider to a professional, modern standard using Scrapy Items and Page Objects (via scrapy-poet). We will completely separate our crawling logic from our parsing logic. This will make our code cleaner, infinitely easier to test, and scalable.

What We'll Build

We will refactor our spider from Part 1. The spider itself will only handle crawling (following links). All the parsing logic will be moved into dedicated "Page Object" classes. scrapy-poet will automatically inject the correct, parsed item into our spider.

Look at how clean our spider's parse_book function becomes:

# The NEW parse_book function
# Where did the parsing logic go?! (Hint: scrapy-poet)

    async def parse_book(self, response, book: BookItem):
        # 'book' is a BookItem, magically injected and parsed
        # by scrapy-poet before this function is even called.
        # We just yield it.
        yield book

Enter fullscreen mode Exit fullscreen mode

Prerequisites

This tutorial builds directly on Part 1: Building Your First Crawling Spider. Please complete that guide first, as we will be modifying the spider we built there.

Step 1: The "Why" (Separation of Concerns)

Our current spider is a monolith. The BooksSpider class knows how to crawl (find next page links, find product links) and how to parse (extract h1 tags, extract p.price_color).

This is bad. If we want to reuse our parsing logic, or test it without re-crawling the web, we can't.

The "Page Object" pattern solves this.

  • The Spider's Job: Crawling. Its only job is to navigate from page to page and yield Requests or Items.
  • The Page Object's Job: Parsing. Its only job is to take a response and extract structured data from it.

scrapy-poet is a library that automatically connects our spider to the correct Page Object.

Step 2: Create Our "Schema" (Scrapy Items)

First, let's define the data we're scraping. Instead of messy dictionaries, we'll use Scrapy Items. Scrapy comes with attrs, a fantastic library for this.

Open tutorial/items.py and add two classes: one for our book data and one for our list page data.

# tutorial/items.py

import attrs

@attrs.define
class BookItem:
    """
    The structured data we extract from a book *detail* page.
    """
    name: str
    price: str
    url: str

@attrs.define
class BookListPage:
    """
    The data and links we extract from a *list* page.
    """
    book_urls: list
    next_page_url: str | None

Enter fullscreen mode Exit fullscreen mode

This is our "schema." It makes our code type-safe and easier to read.

Step 3: Install and Configure scrapy-poet

scrapy-poet is a separate package we need to install.

# Install scrapy-poet
uv add scrapy-poet
# or: pip install scrapy-poet

Enter fullscreen mode Exit fullscreen mode

Now, we must enable it in tutorial/settings.py.

# tutorial/settings.py

# Add this to enable the scrapy-poet add-on
ADDONS = {
    'scrapy_poet.Addon': 300,
}

# Add this to tell scrapy-poet where to find our Page Objects
# 'tutorial.pages' means a folder named 'pages' in our 'tutorial' module
SCRAPY_POET_DISCOVER = ['tutorial.pages']

Enter fullscreen mode Exit fullscreen mode

Step 4: Create Page Objects for Parsing

Now for the magic. Let's create the tutorial/pages module.

mkdir tutorial/pages
touch tutorial/pages/__init__.py

Enter fullscreen mode Exit fullscreen mode

Inside this new folder, create a file named bookstoscrape_com.py. This file will hold all the parsing logic for bookstoscrape.com.

This is the most complex part, but it's a "set it and forget it" pattern.

# tutorial/pages/bookstoscrape_com.py

from web_poet import WebPage, handle_urls, field, returns

# Import our Item schemas
from tutorial.items import BookItem, BookListPage

# This class handles all book DETAIL pages
@handle_urls("[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)")
@returns(BookItem)
class BookDetailPage(WebPage):
    """
    This Page Object handles parsing data from book detail pages.
    """

    # The @field decorator tells scrapy-poet: "run this function
    # and put the result into the 'name' field of the BookItem."
    @field
    def name(self) -> str:
        # This is our parsing logic from Part 1
        return self.response.css("h1::text").get()

    @field
    def price(self) -> str:
        # This is our parsing logic from Part 1
        return self.response.css("p.price_color::text").get()

    @field
    def url(self) -> str:
        return self.response.url

# This class handles all book LIST pages (categories)
@handle_urls("[books.toscrape.com/catalogue/category](https://books.toscrape.com/catalogue/category)")
@returns(BookListPage)
class BookListPageObject(WebPage):
    """
    This Page Object handles parsing data from category/list pages.
    """

    @field
    def book_urls(self) -> list:
        # This is our parsing logic from Part 1
        return self.response.css("article.product_pod h3 a::attr(href)").getall()

    @field
    def next_page_url(self) -> str | None:
        # This is our parsing logic from Part 1
        return self.response.css("li.next a::attr(href)").get()

Enter fullscreen mode Exit fullscreen mode

Look at that! All our messy response.css() calls are now neatly organized in their own classes, completely separate from our spider. The @handle_urls decorator tells scrapy-poet which Page Object to use for which URL.

Step 5: Refactor the Spider (The Payoff)

Now, let's go back to tutorial/spiders/books.py and refactor it. It becomes much simpler.

# tutorial/spiders/books.py

import scrapy
# Import our new Item classes
from tutorial.items import BookItem, BookListPage

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["toscrape.com"]
    url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"

    async def start(self):
        # We still start the same way
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # The 'page: BookListPage' is new.
    # We ask for the BookListPage item, and scrapy-poet injects it.
    async def parse_listpage(self, response, page: BookListPage):

        # 1. Get the parsed book URLs from the Page Object
        for url in page.book_urls:
            # We follow each URL, but our callback no longer
            # needs to do any work!
            yield response.follow(url, callback=self.parse_book)

        # 2. Get the next page URL from the Page Object
        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    # The 'book: BookItem' is new.
    # We ask for the BookItem, and scrapy-poet injects it.
    async def parse_book(self, response, book: BookItem):
        # Our parsing logic is GONE.
        # The 'book' variable is already a fully-populated
        # BookItem, parsed by our BookDetailPage Page Object.

        # We just yield it.
        yield book

Enter fullscreen mode Exit fullscreen mode

Our spider is now only responsible for crawling. All parsing is handled by scrapy-poet and our Page Objects. This code is clean, testable, and incredibly easy to read.

When you run scrapy crawl books -o books.json, the output will be identical to Part 1, but your architecture is now 100x better.

The "Hard Part": Why This Still Breaks

We've built a professional, well-architected Scrapy spider. But we've just made a cleaner version of a spider that will still fail on a real-world site.

This architecture is beautiful, but it doesn't solve the "real" problems:

  • ❌ IP Blocks: You're still hitting the site from one IP. You will be blocked.
  • ❌ CAPTCHAs: You have no way to avoid captchas, and your spider will fail.
  • ❌ JavaScript: If the prices were loaded by JS, our response.css() selectors would find nothing.

We've just organized our failing code.

The "Easy Way": Zyte API as a Universal Page Object
scrapy-poet is a great way to organise your scrapy code, making your projects easier to build, collaborate and maintain. However, it doesn't change the fact we are not doing anything to avoid web scraping bans.

So we can add the below settings using our Zyte API account to run our scrapy project through Zyte API.

# add scrapy-zyte-api python library
uv add scrapy-zyte-api
# settings.py
ZYTE_API_KEY = "YOUR_API_KEY"

ADDONS = {
    "scrapy_zyte_api.Addon": 500,
}
Enter fullscreen mode Exit fullscreen mode

This is the power of combining a great architecture (Scrapy) with a powerful service (Zyte API).

Conclusion & Next Steps

Today you elevated your spider from a simple script to a professional-grade crawler. You learned the "Separation of Concerns" principle, defined data with Items, and separated parsing logic with scrapy-poet's Page Objects.

This is the modern way to build robust, testable, and scalable Scrapy spiders.

What's Next? Join the Community.
💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.
▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.
📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.

And if you're ready to skip the "Hard Part" entirely, get your free API key and try the "Easy Way."

Start Your Free Zyte Trial

Top comments (0)