DEV Community

Cover image for Stop Writing Messy Spiders. The Professional Way with Scrapy-Poet
John Rooney for Zyte

Posted on

Stop Writing Messy Spiders. The Professional Way with Scrapy-Poet

If you were building a web app, you wouldn't cram your database queries and business logic into your API routes. That would be a maintenance nightmare. So why do we accept this in our Scrapy projects? We build massive, unwieldy spiders where crawling logic and parsing logic are tangled together in huge parse methods.

There’s a better way. It’s time to introduce a clean separation of concerns to your spiders.

In this guide, I’ll introduce you to Scrapy-Poet, the official integration of the web-poet library. It allows you to use a powerful architectural pattern called Page Objects, which separates your parsing logic from your spider's crawling duties. The result? Cleaner, more maintainable, and highly testable code.


A Glimpse of the Future: The Page Object Pattern

Let's look at the difference. A traditional spider is often a long file with complex parse_item methods full of CSS and XPath selectors.

The Scrapy-Poet way is different. Your spider becomes a clean, concise crawling manager. Its only jobs are to manage requests, follow links, and hand off the response to the correct Page Object.

Look how clean this spider is:

# products.py (The Spider)

class ProductsSpider(scrapy.Spider):
    name = "products"

    def start_requests(self):
        yield scrapy.Request(self.url, callback=self.parse_list)

    def parse_list(self, response, page: ItemListPage):
        """The spider yields the Page Object, which handles parsing."""
        yield from response.follow_all(urls=page.product_urls, callback=self.parse_product)

    def parse_product(self, response, page: ProductPage):
        """The product page object extracts the final item."""
        yield page.to_item()
Enter fullscreen mode Exit fullscreen mode

Notice what's missing? There are no selectors in the spider! The parse methods simply declare which Page Object (ItemListPage or ProductPage) they expect, and Scrapy-Poet injects it, fully parsed. The spider's job is now pure navigation.


Meet Your New Best Friend: The Page Object

So where did all the parsing logic go? It moved into Page Objects. A Page Object is a simple Python class dedicated to understanding and extracting data from a single type of web page.

The List Page Object

Here’s the Page Object for our product listing page. Its only job is to find all the product URLs.

# pages/list.py (The List Page Object)

class ItemListPage(WebPage):
    """A page object for parsing product listing pages."""

    @field
    def product_urls(self) -> List[str]:
        """Extracts all product URLs from the page."""
        return self.css('.product a::attr(href)').getall()
Enter fullscreen mode Exit fullscreen mode

It’s a simple class that inherits from WebPage and has one method, product_urls, decorated with @field. This method contains the selector and logic to get the links. That's it.

The Product Page Object

The detail page is more complex, but the principle is the same. Each piece of data we want gets its own method, cleanly mapping item fields to selectors.

# pages/product.py (The Product Page Object)

class ProductPage(WebPage):
    """A page object for parsing product detail pages."""

    # This method defines the final, structured item to be returned
    def to_item(self) -> Product:
        return Product(
            url=self.url,
            name=self.name,
            price=self.price,
            # ... other fields
        )

    @field
    def name(self) -> str:
        return self.css('h1.product-title::text').get()

    @field
    def price(self) -> float:
        price_str = self.css('.price::text').get()
        return self._clean_price(price_str) # You can call helper methods

    def _clean_price(self, price_str: str) -> float:
        # ... (data cleaning logic here)
Enter fullscreen mode Exit fullscreen mode

All the logic for finding, extracting, and cleaning product data is now neatly organized in one place. If a selector breaks, you know exactly which file to open.


The Incredible Benefits (Why You'll Never Go Back)

Adopting this pattern isn't just about tidiness; it unlocks professional-grade benefits.

1. Radically Improved Maintainability

When a website changes its layout (and it will), you no longer have to hunt through a giant spider file.

  • Price selector changed? Open pages/product.py and fix the price method.
  • Need to add a new field? Add it to your Item and then add a new @field method in the corresponding Page Object. Your spider remains completely untouched. This isolation makes maintenance fast and painless.

2. Finally, Real Testability 🧪

This is the game-changer. You can now write unit tests for your parsing logic without ever running the spider. Using a framework like pytest, you can feed saved HTML files directly to your Page Objects and assert that they extract the correct data.

# tests/test_product_page.py

def test_product_page_parsing():
    # Load a saved HTML file as a fixture
    html_content = open('fixtures/product.html').read()
    page = ProductPage(html=html_content)

    # Assert that your selectors work as expected
    assert page.name == "Awesome Product"
    assert page.price == 99.99
Enter fullscreen mode Exit fullscreen mode

This means you can validate your selectors in milliseconds, making your spiders incredibly robust and reliable.

3. Supercharged Team Collaboration 🤝

This pattern establishes a clear, repeatable structure for your projects. When a new developer joins the team, the architecture is self-explanatory:

  • items.py defines the data shape.
  • The pages/ directory contains all parsing logic.
  • Spiders in the spiders/ directory handle only crawling.

This consistency makes it easy for anyone to contribute effectively right away.


How to Get Started (It's Easier Than You Think)

Integrating Scrapy-Poet into your project is straightforward.

  1. *Install scrapy-poet: *
pip install scrapy-poet
Enter fullscreen mode Exit fullscreen mode
  1. Activate it in settings.py:

    # settings.py
    
    Addons = {
        'scrapy_poet.Addon': 300,
    }
    
  2. Tell it where to find your Page Objects:

    # settings.py
    
    # This points to the directory where you'll store your page objects.
    SCRAPY_POET_DISCOVER = ["your_project.pages"]
    

Create a pages directory in your project, make it a Python module by adding an __init__.py file, and start building your Page Objects. That’s all it takes.

This is the exact pattern we use internally at Zyte to build and maintain our spiders at scale, and it’s highly recommended by the Scrapy maintainers themselves. It makes your code more structured, more testable, and ultimately, more professional.

Say goodbye to monolithic spiders and hello to a cleaner, more powerful way of scraping.

Full Code Project

git clone https://github.com/johnatzyte/scrapy-poet-demo
Enter fullscreen mode Exit fullscreen mode

Top comments (0)