If you were building a web app, you wouldn't cram your database queries and business logic into your API routes. That would be a maintenance nightmare. So why do we accept this in our Scrapy projects? We build massive, unwieldy spiders where crawling logic and parsing logic are tangled together in huge parse
methods.
There’s a better way. It’s time to introduce a clean separation of concerns to your spiders.
In this guide, I’ll introduce you to Scrapy-Poet, the official integration of the web-poet
library. It allows you to use a powerful architectural pattern called Page Objects, which separates your parsing logic from your spider's crawling duties. The result? Cleaner, more maintainable, and highly testable code.
A Glimpse of the Future: The Page Object Pattern
Let's look at the difference. A traditional spider is often a long file with complex parse_item
methods full of CSS and XPath selectors.
The Scrapy-Poet way is different. Your spider becomes a clean, concise crawling manager. Its only jobs are to manage requests, follow links, and hand off the response to the correct Page Object.
Look how clean this spider is:
# products.py (The Spider)
class ProductsSpider(scrapy.Spider):
name = "products"
def start_requests(self):
yield scrapy.Request(self.url, callback=self.parse_list)
def parse_list(self, response, page: ItemListPage):
"""The spider yields the Page Object, which handles parsing."""
yield from response.follow_all(urls=page.product_urls, callback=self.parse_product)
def parse_product(self, response, page: ProductPage):
"""The product page object extracts the final item."""
yield page.to_item()
Notice what's missing? There are no selectors in the spider! The parse
methods simply declare which Page Object (ItemListPage
or ProductPage
) they expect, and Scrapy-Poet injects it, fully parsed. The spider's job is now pure navigation.
Meet Your New Best Friend: The Page Object
So where did all the parsing logic go? It moved into Page Objects. A Page Object is a simple Python class dedicated to understanding and extracting data from a single type of web page.
The List Page Object
Here’s the Page Object for our product listing page. Its only job is to find all the product URLs.
# pages/list.py (The List Page Object)
class ItemListPage(WebPage):
"""A page object for parsing product listing pages."""
@field
def product_urls(self) -> List[str]:
"""Extracts all product URLs from the page."""
return self.css('.product a::attr(href)').getall()
It’s a simple class that inherits from WebPage
and has one method, product_urls
, decorated with @field
. This method contains the selector and logic to get the links. That's it.
The Product Page Object
The detail page is more complex, but the principle is the same. Each piece of data we want gets its own method, cleanly mapping item fields to selectors.
# pages/product.py (The Product Page Object)
class ProductPage(WebPage):
"""A page object for parsing product detail pages."""
# This method defines the final, structured item to be returned
def to_item(self) -> Product:
return Product(
url=self.url,
name=self.name,
price=self.price,
# ... other fields
)
@field
def name(self) -> str:
return self.css('h1.product-title::text').get()
@field
def price(self) -> float:
price_str = self.css('.price::text').get()
return self._clean_price(price_str) # You can call helper methods
def _clean_price(self, price_str: str) -> float:
# ... (data cleaning logic here)
All the logic for finding, extracting, and cleaning product data is now neatly organized in one place. If a selector breaks, you know exactly which file to open.
The Incredible Benefits (Why You'll Never Go Back)
Adopting this pattern isn't just about tidiness; it unlocks professional-grade benefits.
1. Radically Improved Maintainability
When a website changes its layout (and it will), you no longer have to hunt through a giant spider file.
-
Price selector changed? Open
pages/product.py
and fix theprice
method. -
Need to add a new field? Add it to your Item and then add a new
@field
method in the corresponding Page Object. Your spider remains completely untouched. This isolation makes maintenance fast and painless.
2. Finally, Real Testability 🧪
This is the game-changer. You can now write unit tests for your parsing logic without ever running the spider. Using a framework like pytest
, you can feed saved HTML files directly to your Page Objects and assert that they extract the correct data.
# tests/test_product_page.py
def test_product_page_parsing():
# Load a saved HTML file as a fixture
html_content = open('fixtures/product.html').read()
page = ProductPage(html=html_content)
# Assert that your selectors work as expected
assert page.name == "Awesome Product"
assert page.price == 99.99
This means you can validate your selectors in milliseconds, making your spiders incredibly robust and reliable.
3. Supercharged Team Collaboration 🤝
This pattern establishes a clear, repeatable structure for your projects. When a new developer joins the team, the architecture is self-explanatory:
-
items.py
defines the data shape. - The
pages/
directory contains all parsing logic. - Spiders in the
spiders/
directory handle only crawling.
This consistency makes it easy for anyone to contribute effectively right away.
How to Get Started (It's Easier Than You Think)
Integrating Scrapy-Poet into your project is straightforward.
- *Install scrapy-poet: *
pip install scrapy-poet
-
Activate it in
settings.py
:
# settings.py Addons = { 'scrapy_poet.Addon': 300, }
-
Tell it where to find your Page Objects:
# settings.py # This points to the directory where you'll store your page objects. SCRAPY_POET_DISCOVER = ["your_project.pages"]
Create a pages
directory in your project, make it a Python module by adding an __init__.py
file, and start building your Page Objects. That’s all it takes.
This is the exact pattern we use internally at Zyte to build and maintain our spiders at scale, and it’s highly recommended by the Scrapy maintainers themselves. It makes your code more structured, more testable, and ultimately, more professional.
Say goodbye to monolithic spiders and hello to a cleaner, more powerful way of scraping.
Full Code Project
git clone https://github.com/johnatzyte/scrapy-poet-demo
Top comments (0)