Web scraping is often described as the process of turning the "wild west" of the internet into structured data. However, anyone who has managed a production data pipeline knows that "structured" is a relative term. HTML is inherently chaotic. A price might be a string like "$120.00" on one page, "120" on another, or missing entirely on a third.
If your scraper simply dumps these raw strings into a database, your downstream applications—whether they are price trackers, AI models, or analytics dashboards—will eventually crash. The solution is Schema-First Extraction: an approach to scraping that enforces strict data types at the moment of collection.
We can explore how to implement this using the Zappos.com-Scrapers repository as a blueprint. This guide looks at using Python dataclasses and Node.js helper functions to ensure your data is clean, consistent, and pipeline-ready.
Prerequisites
To follow the code examples, you should have:
- A basic understanding of Python (specifically
dataclasses) or Node.js. - The Zappos.com-Scrapers repository cloned locally.
- Playwright installed in your environment.
Phase 1: The Contract – Analyzing the ScrapedData Dataclass
In a high-quality scraper, the data structure isn't an afterthought. It is the contract that the scraper must fulfill. In the Zappos repository, this contract is defined using Python’s @dataclass.
The implementation in python/playwright/product_data/scraper/zappos_scraper_product_data_v1.py looks like this:
from dataclasses import dataclass, field
from typing import Dict, Any, Optional, List
@dataclass
class ScrapedData:
aggregateRating: Dict[str, Any] = field(default_factory=dict)
availability: str = "in_stock"
brand: str = ""
category: str = ""
currency: str = "USD"
description: str = ""
features: List[str] = field(default_factory=list)
images: List[Dict[str, Any]] = field(default_factory=list)
name: str = ""
preDiscountPrice: Optional[float] = None
price: float = 0.0
productId: str = ""
url: str = ""
Why Explicit Types Matter
By using ScrapedData, we move away from generic, unpredictable dictionaries.
-
price: float = 0.0: This ensures that if a price is missing, we get a consistent numeric fallback rather than aNoneTypeerror during a calculation. -
List[str]: Explicitly typing lists tells your IDE and your pipeline exactly what to expect. -
Optional[float]: This is vital for fields likepreDiscountPrice. Not every item is on sale.Optionalallows us to distinguish between a price of zero and a price that simply doesn't exist.
Phase 2: Enforcing Types – The Extraction Logic
Defining a schema is only half the battle. The second half is the "enforcer" logic, the code that bridges the gap between a messy HTML string and your strict types.
In the Zappos scraper, helper functions act as validators. Consider this parse_price logic:
def parse_price(price_str: str) -> float:
if not price_str:
return 0.0
# Remove commas for large numbers like 1,200.00
cleaned = price_str.replace(",", "")
# Use Regex to extract only the numeric parts, ignoring currency symbols
match = re.search(r'[\d,]+\.?\d*', cleaned)
if match:
try:
return float(match.group())
except ValueError:
return 0.0
return 0.0
The Strategy
This function handles three common "dirty data" scenarios:
- Currency Symbols: It strips
$or€using regex. - Formatting: It removes thousands-separator commas.
- Missing Data: It returns a default
0.0instead of raising an exception that would crash the entire scraping loop.
When building a scraper, use these cleaning utilities rather than accepting raw inner text.
Phase 3: Handling Nulls and Defaults Safely
One of the most frequent causes of pipeline failure is the "None" (null) value. If your database expects an array but receives null, the import fails.
The Zappos repository uses Python's field(default_factory=list) to solve this. This ensures that even if no features or images are found on the page, the resulting JSON contains [] instead of null.
# From python/playwright/product_data/scraper/zappos_scraper_product_data_v1.py
features: List[str] = field(default_factory=list)
images: List[Dict[str, Any]] = field(default_factory=list)
By using a default_factory, every instance of ScrapedData starts with a fresh, empty list. This maintains structural integrity. Your downstream code can always run for image in data['images'] without checking if the key exists or if it's null.
Phase 4: Node.js Comparison – Type Safety in JavaScript
While JavaScript lacks native dataclasses, the Zappos repository achieves the same discipline in its Node.js implementation.
The Node scraper uses a functional approach to mimic type safety in node/playwright/product_data/scraper/zappos_scraper_product_data_v1.js:
const parsePrice = (priceText) => {
if (!priceText) return 0.0;
// Strip commas and extract the float
const match = priceText.replace(/,/g, '').match(/([\d,]+\.?\d*)/);
return match ? parseFloat(match[1]) : 0.0;
};
// Usage inside the extraction logic
const outputData = {
price: parsePrice($('.price-selector').text()),
availability: "in_stock", // Default value
features: [] // Initialized as empty array
};
Python offers better developer tooling through type hints, while Node.js requires more runtime discipline. However, by using a centralized parsePrice function, the Zappos repository ensures that the final JSON output is identical regardless of the language used.
Phase 5: Prompting for Strict Code
This repository was generated using the ScrapeOps AI Scraper Generator. When using AI to build scrapers, don't just ask it to "scrape Zappos." To get production-grade results, your prompt should include the schema requirements.
Example of a Schema-First Prompt:
"Extract product data from Zappos. Use the following JSON schema. Constraints: Prices must be floats (remove currency symbols), lists like 'features' must always return an empty array if no data is found, and 'availability' must be mapped to the string 'in_stock' or 'out_of_stock'."
Providing the schema as the primary requirement forces the generator to create the helper functions (parse_price, clean_float) shown in the Zappos repository. This moves the complexity from the data processing stage to the extraction stage, where it belongs.
To Wrap Up
Strict schema validation is the difference between a script and a data product. By enforcing types at the edge of your network—inside the scraper itself—you prevent technical debt from accumulating in your databases.
The Zappos.com-Scrapers repository demonstrates these principles:
- Use Dataclasses to define a clear contract for your data.
- Implement Helper Functions like
parse_priceto handle HTML inconsistencies. - Default to Empty Collections instead of nulls to keep pipelines running smoothly.
- Ensure Language Agnosticism so your parsing logic produces identical JSON whether you use Python or Node.js.
If you're starting a new project, use the ScrapeOps AI Scraper Generator to build the base extraction logic, then add Pydantic for production-grade data validation.
Top comments (0)