DEV Community

Cover image for Stop Breaking Your Pipeline: Using Schema Validation to Clean Scraped Zappos Data
Jonathan D. Fisher
Jonathan D. Fisher

Posted on

Stop Breaking Your Pipeline: Using Schema Validation to Clean Scraped Zappos Data

Web scraping is often described as the process of turning the "wild west" of the internet into structured data. However, anyone who has managed a production data pipeline knows that "structured" is a relative term. HTML is inherently chaotic. A price might be a string like "$120.00" on one page, "120" on another, or missing entirely on a third.

If your scraper simply dumps these raw strings into a database, your downstream applications—whether they are price trackers, AI models, or analytics dashboards—will eventually crash. The solution is Schema-First Extraction: an approach to scraping that enforces strict data types at the moment of collection.

We can explore how to implement this using the Zappos.com-Scrapers repository as a blueprint. This guide looks at using Python dataclasses and Node.js helper functions to ensure your data is clean, consistent, and pipeline-ready.

Prerequisites

To follow the code examples, you should have:

  • A basic understanding of Python (specifically dataclasses) or Node.js.
  • The Zappos.com-Scrapers repository cloned locally.
  • Playwright installed in your environment.

Phase 1: The Contract – Analyzing the ScrapedData Dataclass

In a high-quality scraper, the data structure isn't an afterthought. It is the contract that the scraper must fulfill. In the Zappos repository, this contract is defined using Python’s @dataclass.

The implementation in python/playwright/product_data/scraper/zappos_scraper_product_data_v1.py looks like this:

from dataclasses import dataclass, field
from typing import Dict, Any, Optional, List

@dataclass
class ScrapedData:
    aggregateRating: Dict[str, Any] = field(default_factory=dict)
    availability: str = "in_stock"
    brand: str = ""
    category: str = ""
    currency: str = "USD"
    description: str = ""
    features: List[str] = field(default_factory=list)
    images: List[Dict[str, Any]] = field(default_factory=list)
    name: str = ""
    preDiscountPrice: Optional[float] = None
    price: float = 0.0
    productId: str = ""
    url: str = ""
Enter fullscreen mode Exit fullscreen mode

Why Explicit Types Matter

By using ScrapedData, we move away from generic, unpredictable dictionaries.

  • price: float = 0.0: This ensures that if a price is missing, we get a consistent numeric fallback rather than a NoneType error during a calculation.
  • List[str]: Explicitly typing lists tells your IDE and your pipeline exactly what to expect.
  • Optional[float]: This is vital for fields like preDiscountPrice. Not every item is on sale. Optional allows us to distinguish between a price of zero and a price that simply doesn't exist.

Phase 2: Enforcing Types – The Extraction Logic

Defining a schema is only half the battle. The second half is the "enforcer" logic, the code that bridges the gap between a messy HTML string and your strict types.

In the Zappos scraper, helper functions act as validators. Consider this parse_price logic:

def parse_price(price_str: str) -> float:
    if not price_str: 
        return 0.0
    # Remove commas for large numbers like 1,200.00
    cleaned = price_str.replace(",", "")
    # Use Regex to extract only the numeric parts, ignoring currency symbols
    match = re.search(r'[\d,]+\.?\d*', cleaned)
    if match:
        try:
            return float(match.group())
        except ValueError:
            return 0.0
    return 0.0
Enter fullscreen mode Exit fullscreen mode

The Strategy

This function handles three common "dirty data" scenarios:

  1. Currency Symbols: It strips $ or using regex.
  2. Formatting: It removes thousands-separator commas.
  3. Missing Data: It returns a default 0.0 instead of raising an exception that would crash the entire scraping loop.

When building a scraper, use these cleaning utilities rather than accepting raw inner text.

Phase 3: Handling Nulls and Defaults Safely

One of the most frequent causes of pipeline failure is the "None" (null) value. If your database expects an array but receives null, the import fails.

The Zappos repository uses Python's field(default_factory=list) to solve this. This ensures that even if no features or images are found on the page, the resulting JSON contains [] instead of null.

# From python/playwright/product_data/scraper/zappos_scraper_product_data_v1.py

features: List[str] = field(default_factory=list)
images: List[Dict[str, Any]] = field(default_factory=list)
Enter fullscreen mode Exit fullscreen mode

By using a default_factory, every instance of ScrapedData starts with a fresh, empty list. This maintains structural integrity. Your downstream code can always run for image in data['images'] without checking if the key exists or if it's null.

Phase 4: Node.js Comparison – Type Safety in JavaScript

While JavaScript lacks native dataclasses, the Zappos repository achieves the same discipline in its Node.js implementation.

The Node scraper uses a functional approach to mimic type safety in node/playwright/product_data/scraper/zappos_scraper_product_data_v1.js:

const parsePrice = (priceText) => {
    if (!priceText) return 0.0;
    // Strip commas and extract the float
    const match = priceText.replace(/,/g, '').match(/([\d,]+\.?\d*)/);
    return match ? parseFloat(match[1]) : 0.0;
};

// Usage inside the extraction logic
const outputData = {
    price: parsePrice($('.price-selector').text()),
    availability: "in_stock", // Default value
    features: [] // Initialized as empty array
};
Enter fullscreen mode Exit fullscreen mode

Python offers better developer tooling through type hints, while Node.js requires more runtime discipline. However, by using a centralized parsePrice function, the Zappos repository ensures that the final JSON output is identical regardless of the language used.

Phase 5: Prompting for Strict Code

This repository was generated using the ScrapeOps AI Scraper Generator. When using AI to build scrapers, don't just ask it to "scrape Zappos." To get production-grade results, your prompt should include the schema requirements.

Example of a Schema-First Prompt:

"Extract product data from Zappos. Use the following JSON schema. Constraints: Prices must be floats (remove currency symbols), lists like 'features' must always return an empty array if no data is found, and 'availability' must be mapped to the string 'in_stock' or 'out_of_stock'."

Providing the schema as the primary requirement forces the generator to create the helper functions (parse_price, clean_float) shown in the Zappos repository. This moves the complexity from the data processing stage to the extraction stage, where it belongs.

To Wrap Up

Strict schema validation is the difference between a script and a data product. By enforcing types at the edge of your network—inside the scraper itself—you prevent technical debt from accumulating in your databases.

The Zappos.com-Scrapers repository demonstrates these principles:

  • Use Dataclasses to define a clear contract for your data.
  • Implement Helper Functions like parse_price to handle HTML inconsistencies.
  • Default to Empty Collections instead of nulls to keep pipelines running smoothly.
  • Ensure Language Agnosticism so your parsing logic produces identical JSON whether you use Python or Node.js.

If you're starting a new project, use the ScrapeOps AI Scraper Generator to build the base extraction logic, then add Pydantic for production-grade data validation.

Top comments (0)