Robert N. Gutierrez

Posted on Mar 4

Bulletproof Data Pipelines: Adding Schema Validation to Nike Scrapers

#webdev #devops #webscraping #python

Imagine running a large-scale scraper for days, only to discover that 40% of your dataset has a price of 0 or a null product name. The script didn't crash, your logs show "200 OK," and your database is growing—but the data is worthless.

In web scraping, this is a soft break. Unlike a hard crash, where a changed HTML selector throws an error and stops the script, a soft break happens when extraction logic fails silently, returning empty strings or default values instead of real data.

This guide explains how to prevent these silent failures by adding strict schema validation to the Nike.in Scrapers repository. We’ll use Zod for Node.js and Pydantic for Python to transform fragile scrapers into reliable data pipelines.

Prerequisites

To follow along, you'll need:

A basic understanding of Node.js or Python.
The Nike.in Scrapers repository cloned locally.
Node.js (v16+) or Python (3.8+) installed.

The Anatomy of a Soft Break

The current implementation in the repository is vulnerable because it lacks strict checks. If we look at the Node.js Puppeteer scraper at node/puppeteer/product_data/scraper/nike_scraper_product_data_v1.js, we see a common pattern in the extractData function:

const parsePrice = (priceText) => {
    const match = priceText.match(/[\d,.]+/);
    if (!match) return 0; // Defaulting to 0 if regex fails
    const clean = match[0].replace(/,/g, '');
    return parseFloat(clean) || 0;
};

If Nike changes their price HTML class from .css-nqh3vz to something else, priceText becomes an empty string. The regex fails, and the function returns 0.

To your database, 0 is a valid number. To your business logic, a Nike shoe costing $0 is a critical error. Without validation, your JSONL output ends up looking like this:

{"name": "Nike Air Max", "price": 0, "url": "https://nike.in/...", "availability": "in_stock"}
{"name": "", "price": 12995, "url": "https://nike.in/...", "availability": "in_stock"}

This data is valid JSON, but it's invalid information.

Strategy: Validation at the Pipeline Level

Avoid cluttering extraction logic with hundreds of if/else checks. Instead, treat the DataPipeline class as a gatekeeper.

The repository architecture uses a DataPipeline to handle deduplication and file writing. By inserting a validation step here, we ensure that no garbage data ever touches persistent storage.

The Flow:
Scraper -> Extractor -> Validator (New) -> Writer

If the data fails validation, we log the error and drop the record. If too many records fail, we can trigger a "Circuit Breaker" to stop the scraper entirely.

Implementation 1: Node.js with Zod

Zod is a schema declaration and validation library. It’s ideal for Node.js scrapers because it allows us to define the "Product" structure in a single, reusable schema.

1. Install Zod

Navigate to your Node.js project folder and install the package:

npm install zod

2. Define the Schema

In node/puppeteer/product_data/scraper/nike_scraper_product_data_v1.js, define a schema that mirrors the expected output. This ensures the price is a positive number and the productId is not empty.

const { z } = require('zod');

const ProductSchema = z.object({
    productId: z.string().min(1, "Product ID is missing"),
    name: z.string().min(3, "Product name is too short"),
    price: z.number().positive("Price must be greater than 0"),
    currency: z.string().length(3),
    url: z.string().url(),
    availability: z.enum(["in_stock", "out_of_stock"])
});

3. Update the DataPipeline

Modify the addData method in the DataPipeline class to use this schema.

class DataPipeline {
    constructor(outputFile = CONFIG.outputFile) {
        this.itemsSeen = new Set();
        this.outputFile = outputFile;
        this.errorCount = 0; 
    }

    async addData(scrapedData) {
        // Validate the data
        const validation = ProductSchema.safeParse(scrapedData);

        if (!validation.success) {
            this.errorCount++;
            console.error(`❌ Validation Failed for ${scrapedData.url}:`, validation.error.format());
            return; // Skip writing this item
        }

        if (!this.isDuplicate(validation.data)) {
            const jsonLine = JSON.stringify(validation.data) + '\n';
            await fs.promises.appendFile(this.outputFile, jsonLine, 'utf8');
            console.log('✅ Saved valid item:', validation.data.name);
        }
    }
}

Implementation 2: Python with Pydantic

For Python developers, Pydantic is the standard choice. The Python scrapers in the repo currently use standard dataclasses. We can upgrade these to Pydantic BaseModels for automatic validation.

1. Define the Pydantic Model

Replace the existing @dataclass with a Pydantic model. Pydantic handles type coercion automatically, such as converting a string "12995" to a float 12995.0.

from pydantic import BaseModel, Field, field_validator

class ScrapedData(BaseModel):
    productId: str = Field(min_length=1)
    name: str = Field(min_length=1)
    price: float = Field(gt=0)
    currency: str = Field(default="INR")
    url: str
    availability: str

    @field_validator('availability')
    @classmethod
    def validate_availability(cls, v: str) -> str:
        if v not in ['in_stock', 'out_of_stock']:
            return 'out_of_stock'
        return v

2. Integrate into the Pipeline

Update the add_data method to catch ValidationError.

class DataPipeline:
    def add_data(self, raw_data: dict):
        try:
            # Trigger validation logic
            validated_data = ScrapedData(**raw_data)

            data_dict = validated_data.model_dump()
            if not self.is_duplicate(data_dict):
                with open(self.jsonl_filename, mode="a", encoding="UTF-8") as f:
                    f.write(json.dumps(data_dict) + "\n")
                logger.info(f"Saved: {validated_data.name}")
        except Exception as e:
            logger.error(f"❌ Data Validation Error: {e}")

Error Budgets: The Circuit Breaker Pattern

Validation filters out single bad records, but what if the entire site layout changes? If 100% of your requests fail validation, you should stop the scraper to save on proxy costs and compute resources.

Implement a circuit breaker by tracking the failure rate. If the rate exceeds a specific threshold, such as 50% of the last several items, kill the process.

// Inside DataPipeline.addData (Node.js)
this.totalProcessed++;
if (!validation.success) {
    this.errorCount++;
    const failureRate = this.errorCount / this.totalProcessed;

    if (this.totalProcessed > 10 && failureRate > 0.5) {
        console.error("🚨 CRITICAL: Failure rate above 50%. Site layout likely changed. Aborting.");
        process.exit(1); 
    }
    return;
}

Closing the Loop: Regenerating Scrapers

When validation triggers a failure alert or trips the circuit breaker, your extraction logic is likely outdated.

Instead of manually hunting for new CSS selectors, use the ScrapeOps AI Scraper Generator. You can feed the failing Nike URL into the generator to receive updated selectors.

Validate: Catch the break immediately with Zod or Pydantic.
Detect: Trigger an alert when the failure rate spikes.
Regenerate: Use ScrapeOps to get new selectors and resume the pipeline.

Summary

Strict schema validation is what separates a simple script from a professional data pipeline. By moving validation into the DataPipeline class, you ensure high-quality data without overcomplicating your extraction logic.

Silent failures are more dangerous than crashes because they corrupt your dataset over time.
Zod and Pydantic enforce data types and business rules at runtime.
Circuit Breakers protect resources by stopping scrapers when site layouts change significantly.
Verify everything: Never let data hit your database without validation.

To practice, try cloning the Nike.in-Scrapers repository and adding Zod validation to the product_search scrapers to ensure every result contains a valid price.

DEV Community