Imagine running a large-scale scraper for days, only to discover that 40% of your dataset has a price of 0 or a null product name. The script didn't crash, your logs show "200 OK," and your database is growing—but the data is worthless.
In web scraping, this is a soft break. Unlike a hard crash, where a changed HTML selector throws an error and stops the script, a soft break happens when extraction logic fails silently, returning empty strings or default values instead of real data.
This guide explains how to prevent these silent failures by adding strict schema validation to the Nike.in Scrapers repository. We’ll use Zod for Node.js and Pydantic for Python to transform fragile scrapers into reliable data pipelines.
Prerequisites
To follow along, you'll need:
- A basic understanding of Node.js or Python.
- The Nike.in Scrapers repository cloned locally.
- Node.js (v16+) or Python (3.8+) installed.
The Anatomy of a Soft Break
The current implementation in the repository is vulnerable because it lacks strict checks. If we look at the Node.js Puppeteer scraper at node/puppeteer/product_data/scraper/nike_scraper_product_data_v1.js, we see a common pattern in the extractData function:
const parsePrice = (priceText) => {
const match = priceText.match(/[\d,.]+/);
if (!match) return 0; // Defaulting to 0 if regex fails
const clean = match[0].replace(/,/g, '');
return parseFloat(clean) || 0;
};
If Nike changes their price HTML class from .css-nqh3vz to something else, priceText becomes an empty string. The regex fails, and the function returns 0.
To your database, 0 is a valid number. To your business logic, a Nike shoe costing $0 is a critical error. Without validation, your JSONL output ends up looking like this:
{"name": "Nike Air Max", "price": 0, "url": "https://nike.in/...", "availability": "in_stock"}
{"name": "", "price": 12995, "url": "https://nike.in/...", "availability": "in_stock"}
This data is valid JSON, but it's invalid information.
Strategy: Validation at the Pipeline Level
Avoid cluttering extraction logic with hundreds of if/else checks. Instead, treat the DataPipeline class as a gatekeeper.
The repository architecture uses a DataPipeline to handle deduplication and file writing. By inserting a validation step here, we ensure that no garbage data ever touches persistent storage.
The Flow:
Scraper -> Extractor -> Validator (New) -> Writer
If the data fails validation, we log the error and drop the record. If too many records fail, we can trigger a "Circuit Breaker" to stop the scraper entirely.
Implementation 1: Node.js with Zod
Zod is a schema declaration and validation library. It’s ideal for Node.js scrapers because it allows us to define the "Product" structure in a single, reusable schema.
1. Install Zod
Navigate to your Node.js project folder and install the package:
npm install zod
2. Define the Schema
In node/puppeteer/product_data/scraper/nike_scraper_product_data_v1.js, define a schema that mirrors the expected output. This ensures the price is a positive number and the productId is not empty.
const { z } = require('zod');
const ProductSchema = z.object({
productId: z.string().min(1, "Product ID is missing"),
name: z.string().min(3, "Product name is too short"),
price: z.number().positive("Price must be greater than 0"),
currency: z.string().length(3),
url: z.string().url(),
availability: z.enum(["in_stock", "out_of_stock"])
});
3. Update the DataPipeline
Modify the addData method in the DataPipeline class to use this schema.
class DataPipeline {
constructor(outputFile = CONFIG.outputFile) {
this.itemsSeen = new Set();
this.outputFile = outputFile;
this.errorCount = 0;
}
async addData(scrapedData) {
// Validate the data
const validation = ProductSchema.safeParse(scrapedData);
if (!validation.success) {
this.errorCount++;
console.error(`❌ Validation Failed for ${scrapedData.url}:`, validation.error.format());
return; // Skip writing this item
}
if (!this.isDuplicate(validation.data)) {
const jsonLine = JSON.stringify(validation.data) + '\n';
await fs.promises.appendFile(this.outputFile, jsonLine, 'utf8');
console.log('✅ Saved valid item:', validation.data.name);
}
}
}
Implementation 2: Python with Pydantic
For Python developers, Pydantic is the standard choice. The Python scrapers in the repo currently use standard dataclasses. We can upgrade these to Pydantic BaseModels for automatic validation.
1. Define the Pydantic Model
Replace the existing @dataclass with a Pydantic model. Pydantic handles type coercion automatically, such as converting a string "12995" to a float 12995.0.
from pydantic import BaseModel, Field, field_validator
class ScrapedData(BaseModel):
productId: str = Field(min_length=1)
name: str = Field(min_length=1)
price: float = Field(gt=0)
currency: str = Field(default="INR")
url: str
availability: str
@field_validator('availability')
@classmethod
def validate_availability(cls, v: str) -> str:
if v not in ['in_stock', 'out_of_stock']:
return 'out_of_stock'
return v
2. Integrate into the Pipeline
Update the add_data method to catch ValidationError.
class DataPipeline:
def add_data(self, raw_data: dict):
try:
# Trigger validation logic
validated_data = ScrapedData(**raw_data)
data_dict = validated_data.model_dump()
if not self.is_duplicate(data_dict):
with open(self.jsonl_filename, mode="a", encoding="UTF-8") as f:
f.write(json.dumps(data_dict) + "\n")
logger.info(f"Saved: {validated_data.name}")
except Exception as e:
logger.error(f"❌ Data Validation Error: {e}")
Error Budgets: The Circuit Breaker Pattern
Validation filters out single bad records, but what if the entire site layout changes? If 100% of your requests fail validation, you should stop the scraper to save on proxy costs and compute resources.
Implement a circuit breaker by tracking the failure rate. If the rate exceeds a specific threshold, such as 50% of the last several items, kill the process.
// Inside DataPipeline.addData (Node.js)
this.totalProcessed++;
if (!validation.success) {
this.errorCount++;
const failureRate = this.errorCount / this.totalProcessed;
if (this.totalProcessed > 10 && failureRate > 0.5) {
console.error("🚨 CRITICAL: Failure rate above 50%. Site layout likely changed. Aborting.");
process.exit(1);
}
return;
}
Closing the Loop: Regenerating Scrapers
When validation triggers a failure alert or trips the circuit breaker, your extraction logic is likely outdated.
Instead of manually hunting for new CSS selectors, use the ScrapeOps AI Scraper Generator. You can feed the failing Nike URL into the generator to receive updated selectors.
- Validate: Catch the break immediately with Zod or Pydantic.
- Detect: Trigger an alert when the failure rate spikes.
- Regenerate: Use ScrapeOps to get new selectors and resume the pipeline.
Summary
Strict schema validation is what separates a simple script from a professional data pipeline. By moving validation into the DataPipeline class, you ensure high-quality data without overcomplicating your extraction logic.
- Silent failures are more dangerous than crashes because they corrupt your dataset over time.
- Zod and Pydantic enforce data types and business rules at runtime.
- Circuit Breakers protect resources by stopping scrapers when site layouts change significantly.
- Verify everything: Never let data hit your database without validation.
To practice, try cloning the Nike.in-Scrapers repository and adding Zod validation to the product_search scrapers to ensure every result contains a valid price.
Top comments (0)