You’ve likely experienced the "Monday Morning Surprise." You check your database after a weekend of automated scraping only to find thousands of new rows where the price column is empty, the product_name is "None," and the stock_count is zero.
The script didn't crash, your proxies worked perfectly, and the status codes were all 200 OK. But because the website owner changed a single CSS class from .price-value to .item-price, your scraper spent 48 hours collecting digital garbage.
In the world of Scraper Reliability Engineering (SRE), this is a silent failure. While traditional error handling focuses on network stability, Pydantic allows you to treat scraped data like a strict API contract. This approach detects layout changes the second they happen, ensuring your data pipeline remains untainted.
Why try/except Isn't Enough
Most developers write defensive scrapers, wrapping extraction logic in try/except blocks to prevent the process from crashing when a single element is missing. While this keeps the script running, it often leads to silent data corruption.
Consider this common pattern:
def extract_product(element):
try:
return {
"name": element.query_selector(".title").inner_text(),
"price": float(element.query_selector(".price").inner_text().replace("$", ""))
}
except Exception:
return None # The silent killer
If the .price selector fails, this function returns None. The main loop sees the None, skips the entry, and continues. You’ve lost data and won't know why until you manually inspect the logs days later. You need to move away from "just don't crash" and toward "fail fast when the contract is broken."
Step 1: Defining the Data Contract with Pydantic
To build a reliable scraper, you must first define what "good data" looks like. Pydantic is a data validation library that uses Python type hints to enforce schemas. Instead of working with raw dictionaries, you map scraped data to a BaseModel.
Using Product Hunt as an example, we can define a model for the product name, upvote count, and primary tags.
from pydantic import BaseModel, Field
from typing import List
class ProductHuntItem(BaseModel):
name: str = Field(min_length=1)
upvotes: int = Field(ge=0) # Must be an integer >= 0
tags: List[str]
rank: int
By defining this model, you’ve created a data contract. If you try to pass a string like "1.2k" into upvotes, which expects an int, Pydantic will immediately raise a ValidationError.
Step 2: Integrating Validation with Playwright
When integrating this into a scraper using Playwright, the goal is to extract raw data into a dictionary and immediately attempt to instantiate the Pydantic model.
import asyncio
from playwright.async_api import async_playwright
from pydantic import ValidationError
async def scrape_product_hunt():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto("https://www.producthunt.com/")
products = await page.query_selector_all('[data-test="post-item"]')
valid_items = []
for index, product in enumerate(products[:10]):
try:
# 1. Extract raw data
raw_data = {
"name": await product.query_selector('[data-test="post-name"]').inner_text(),
"upvotes": await product.query_selector('[data-test="vote-button"]').inner_text(),
"tags": [await product.query_selector('[data-test="post-tag"]').inner_text()],
"rank": index + 1
}
# 2. Validate with Pydantic
item = ProductHuntItem(**raw_data)
valid_items.append(item)
except ValidationError as e:
print(f"Layout Change Detected at item {index}: {e.json()}")
except Exception as e:
print(f"Generic extraction error: {e}")
await browser.close()
return valid_items
if __name__ == "__main__":
asyncio.run(scrape_product_hunt())
In this workflow, you extract data as a "dirty" dictionary first. When you call ProductHuntItem(**raw_data), Pydantic performs the type checking. If the website changes its structure, such as the upvote button containing text like "Free" instead of a number, the script catches it immediately.
Step 3: Catching the Drift
Imagine Product Hunt updates their UI. The upvote count, which used to be a simple integer in the HTML, is now formatted as "2,450".
A standard Python int() conversion would fail, and a basic scraper might just store None. Pydantic provides a clear error trace instead:
{
"type": "int_parsing",
"loc": ["upvotes"],
"msg": "Input should be a valid integer, unable to parse string as an integer",
"input": "2,450"
}
This is shift-left testing for data extraction. You catch the error at the point of ingestion rather than weeks later during analysis.
Step 4: Advanced Cleaning with Validators
Websites are messy. Sometimes a layout change doesn't mean the data is gone, but that the format has shifted. Pydantic's @field_validator allows you to fix minor issues automatically before validation.
This updated model handles commas in numbers and "k" suffixes:
from pydantic import BaseModel, Field, field_validator
class ProductHuntItem(BaseModel):
name: str
upvotes: int
@field_validator('upvotes', mode='before')
@classmethod
def clean_upvotes(cls, v: any) -> int:
if isinstance(v, str):
v = v.replace(',', '').strip().lower()
if 'k' in v:
return int(float(v.replace('k', '')) * 1000)
return int(v)
return v
By using mode='before', you intercept the raw data before Pydantic tries to force it into an integer. This makes the scraper resilient to cosmetic changes while maintaining a strict floor for data quality. If the input is "Add to Cart," the validator will fail and alert you to a total layout shift.
Step 5: The Error Budget
In production web scraping, 100% success is rarely possible. A single item might fail validation because of a unique ad placement or a weird edge case. You don't want to wake up an engineer for one failed row.
Instead, implement an error budget. Track the ratio of validation failures to successful extractions.
class ScrapeSession:
def __init__(self):
self.success_count = 0
self.failure_count = 0
self.threshold = 0.15 # 15% error budget
def log_result(self, success: bool):
if success:
self.success_count += 1
else:
self.failure_count += 1
def check_reliability(self):
total = self.success_count + self.failure_count
if total == 0: return
error_rate = self.failure_count / total
if error_rate > self.threshold:
self.trigger_alert(error_rate)
def trigger_alert(self, rate):
print(f"CRITICAL: Scraper reliability dropped to {1-rate:.2%}. Layout change likely.")
# Usage in loop
session = ScrapeSession()
try:
ProductHuntItem(**raw_data)
session.log_result(True)
except ValidationError:
session.log_result(False)
session.check_reliability()
If 2% of items fail, it is likely noise. If 20% fail, the website has likely changed its DOM structure and your error budget is spent. This is the moment to trigger a Slack notification or a PagerDuty alert.
Summary
Moving from basic scripts to Scraper Reliability Engineering requires a mindset shift. By treating scraped data as a typed contract rather than a loose collection of strings, you protect downstream applications from garbage data.
- Fail Fast: Use Pydantic to catch layout changes during extraction.
- Define Contracts: Create models that represent the ideal state of your data.
- Automate Cleaning: Use Pydantic validators to handle predictable noise like currency symbols.
- Monitor Ratios: Use an error budget to distinguish between minor glitches and systemic layout changes.
Implementing these patterns ensures you spend less time cleaning broken databases and more time building features with data you can trust.
Top comments (0)