DEV Community

Cover image for Stop Silent Failures: Using Python Dataclasses to Catch Stale CSS Selectors
Jerry A. Henley
Jerry A. Henley

Posted on

Stop Silent Failures: Using Python Dataclasses to Catch Stale CSS Selectors

The nightmare of every web scraping developer is the silent failure. Your scraper runs perfectly, the logs show a 200 OK status for every request, and your database is growing. But when you check the data, you realize that for the last three days, every product price has been recorded as None or $0.00.

What happened? The target website changed a single <div> class, and your CSS selector stopped working. Because your code was designed to handle missing elements gracefully, it didn't crash; it just started collecting empty data.

We can solve this by transforming data containers from passive storage bins into active validation layers. This guide explores how to use Python dataclasses to detect DOM changes in real-time, using the CrateandbarrelScrapers repository as a practical case study.

Prerequisites

To follow along, you should have:

  • Intermediate knowledge of Python.
  • Experience with web scraping libraries like Selenium or Playwright.
  • A solid understanding of CSS selectors.

The Anatomy of a Passive Dataclass

Let's look at a common implementation found in the repository's Selenium product scraper: python/selenium/product_data/scraper/crateandbarrel_scraper_product_data_v1.py.

The scraper defines a ScrapedData class to hold the extracted information:

@dataclass
class ScrapedData:
    aggregateRating: Dict[str, Any] = field(default_factory=dict)
    availability: str = "out_of_stock"
    brand: str = "Crate & Barrel"
    name: str = ""
    price: Optional[float] = None
    productId: str = ""
    url: str = ""
    # ... other fields
Enter fullscreen mode Exit fullscreen mode

This is a passive dataclass. It’s excellent for organization, but it has a major flaw: it permits silent failures. If the extract_data function fails to find the product name or price due to a DOM change, the dataclass is instantiated with default values like "" or None.

The pipeline then saves this empty object to your JSONL file without raising an alarm. This assumes the DOM is static, which is a dangerous assumption in production scraping.

Strategy 1: Implementing Mandatory Field Validation

The first step to robust scraping is defining which fields are critical. For a site like Crate & Barrel, a product without a productId or a name is useless.

The __post_init__ method in Python dataclasses runs validation logic immediately after the object is created. If a critical field is missing, we can raise a custom exception.

class DataValidationError(Exception):
    """Raised when scraped data fails business logic validation."""
    pass

@dataclass
class ScrapedData:
    name: str = ""
    productId: str = ""
    price: Optional[float] = None
    availability: str = "out_of_stock"

    def __post_init__(self):
        self.validate()

    def validate(self):
        critical_fields = {
            "name": self.name,
            "productId": self.productId,
        }

        missing = [field for field, value in critical_fields.items() if not value]

        if missing:
            raise DataValidationError(f"Missing critical fields: {', '.join(missing)}")
Enter fullscreen mode Exit fullscreen mode

With this added, the moment your CSS selector for h1.product-name fails, the scraper raises a DataValidationError. This transforms a silent data quality issue into a trackable error.

Strategy 2: Context-Aware Logic

DOM changes are often subtle. A site might change the price layout for "In Stock" items but keep it the same for "Clearance" items. We can use business logic to detect when a specific selector is stale.

At Crate & Barrel, if a product is listed as "In Stock," it must have a price greater than zero. If the price is missing but the product is available, your price selector is almost certainly broken.

    def validate(self):
        # ... previous validation ...

        # Context-aware check: Price vs Availability
        if self.availability == "in_stock":
            if self.price is None or self.price <= 0:
                 raise DataValidationError(
                     f"Product {self.productId} is 'in_stock' but price is {self.price}. "
                     "Selector likely broken."
                 )
Enter fullscreen mode Exit fullscreen mode

This logic catches the scenario where the scraper successfully finds the page and the availability but fails to extract the specific price value.

Strategy 3: The "Error Budget" Pattern

While we want to catch errors, a single malformed product page shouldn't crash a 10,000-item scrape. In production, use an Error Budget.

We can modify the DataPipeline class from the Crate & Barrel repo to track the failure rate. If only 1% of products fail validation, it might be a data edge case. If 15% fail, the website has likely updated its layout.

class DataPipeline:
    def __init__(self, jsonl_filename="output.jsonl", failure_threshold=0.05):
        self.jsonl_filename = jsonl_filename
        self.total_attempted = 0
        self.failure_count = 0
        self.threshold = failure_threshold

    def add_data(self, scraped_data: ScrapedData):
        self.total_attempted += 1
        try:
            scraped_data.validate()
            # ... save to file logic ...
        except DataValidationError as e:
            self.failure_count += 1
            logger.error(f"Validation failed: {e}")
            self.check_error_budget()

    def check_error_budget(self):
        if self.total_attempted > 10:  # Wait for a small sample size
            failure_rate = self.failure_count / self.total_attempted
            if failure_rate > self.threshold:
                # Trigger a critical alert
                raise SystemExit(f"CRITICAL: Failure rate {failure_rate:.2%} exceeds budget! DOM change detected.")
Enter fullscreen mode Exit fullscreen mode

Implementation Guide: Integrating into the Scraper

To implement this in the existing Crate & Barrel Selenium scraper, wrap the extraction call to handle the new validation logic.

Before (Passive)

# In the original loop
data = extract_data(driver, url)
pipeline.add_data(data) # Data added even if empty
Enter fullscreen mode Exit fullscreen mode

After (Active Validation)

# Modified loop in the scraper
try:
    data = extract_data(driver, url)
    if data:
        # validate() is called inside add_data or via __post_init__
        pipeline.add_data(data) 
except DataValidationError as e:
    logger.warning(f"Skipping {url} due to DOM change: {e}")
except Exception as e:
    logger.error(f"Hard failure on {url}: {e}")
Enter fullscreen mode Exit fullscreen mode

This change ensures that your output.jsonl file contains only validated data, while your logs provide an early warning system for site updates.

Performance and Trade-offs

Using dataclasses for validation is lightweight and built into the Python standard library. However, as projects grow, consider these trade-offs:

  1. Complexity: For complex nested structures, manual validation in __post_init__ can become verbose.
  2. Maintenance: You must update validation logic whenever business requirements change, such as if the site starts listing "Price on Request" items.

For larger projects, Pydantic is a strong alternative. It offers a more powerful version of this pattern with automatic type coercion and advanced validation, though it does add a third-party dependency.

To Wrap Up

Detecting DOM changes should happen during extraction, not after the data is already in your database. By moving validation into the dataclass layer, you catch issues immediately.

Key Takeaways:

  • Transform Passive Containers: Use __post_init__ to turn dataclasses into active validators.
  • Enforce Critical Fields: Identify fields that must exist, such as ID or Name, and raise explicit errors if they are missing.
  • Apply Business Logic: Use relationships between data, like Price vs. Availability, to detect stale selectors.
  • Monitor the Failure Rate: Use an "Error Budget" to distinguish between a single bad page and a site-wide layout change.

To see these patterns in a production context, explore the implementations in the Crateandbarrel Scrapers repository. If you prefer not to hunt for selectors manually, the ScrapeOps AI Scraper Generator can generate Python code for you based on a URL, following the same structure used in the Crate & Barrel examples.

Top comments (0)