AI tools like ChatGPT and GitHub Copilot have fundamentally changed the speed at which we can build web scrapers. We can now generate complex Playwright scripts with accurate CSS selectors and extraction logic in seconds. However, a significant gap exists between a script that runs on your laptop and one that survives in a production environment.
AI is excellent at solving the "extraction problem" the what of scraping. But it often fails at the "operational problem" the how of maintaining reliability over time. When a website changes its layout or a proxy fails, a raw AI-generated script often fails silently, returning empty data while reporting a successful run.
This guide takes a raw AI-generated script from the Dermstore.com-Scrapers repository and transforms it into a reliable, production-ready data pipeline.
Prerequisites
To follow along, you should have:
- Intermediate Python knowledge (decorators, dataclasses, and logging).
- Playwright installed (
pip install playwright). - A basic understanding of the
dermstore_scraper_product_data_v1.pyscript structure found in the repository.
The Audit: Critiquing the Raw AI Code
The baseline script python/playwright/product_data/scraper/dermstore_scraper_product_data_v1.py has solid extraction logic, but it contains three major operational blind spots:
- Unstructured Logging: It uses
logging.basicConfig(level=logging.INFO). In production, searching through thousands of lines of text for a specific error is inefficient. We need structured JSON logs. - Silent Data Failures: The
ScrapedDatadataclass uses defaults likeprice: float = 0.0. If the scraper fails to find the price selector, the script saves a $0.0 price and continues without an error. - Lack of Health Metrics: There is no way to see the big picture. We don't know the success rate across 1,000 URLs or how many items failed validation.
We need to move from a script that just scrapes to a system that observes.
Step 1: Implementing Structured Logging
In production, logs are usually sent to a centralized system like Datadog, ELK, or CloudWatch. These tools work best with JSON-structured logs. Instead of a string like "Saved item to output.jsonl", we want a searchable object.
This refactor includes context, such as the URL being processed and the specific error type.
import logging
import json
import sys
class JsonFormatter(logging.Formatter):
def format(self, record):
log_record = {
"timestamp": self.formatTime(record, self.datefmt),
"level": record.levelname,
"message": record.getMessage(),
"logger": record.name
}
if hasattr(record, "context"):
log_record.update(record.context)
return json.dumps(log_record)
# Setup structured logger
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logger = logging.getLogger("dermstore_production_scraper")
logger.addHandler(handler)
logger.setLevel(logging.INFO)
Now, when we log an error in the extract_data function, we can provide rich context:
# Inside extract_data
except Exception as e:
logger.error("extraction_failed", extra={
"context": {
"url": page.url,
"error_type": type(e).__name__,
"attempt": attempts
}
})
Step 2: Data Quality Validation
A scraper that returns 10,000 products with empty names is worse than a scraper that simply crashes. The latter is a known failure; the former is silent data corruption.
We will extend the DataPipeline class found in the repository to include a validation step. We'll define critical fields that must exist for the data to be considered valid.
class DataValidationError(Exception):
"""Custom exception for invalid scraped data."""
pass
class DataPipeline:
def __init__(self, jsonl_filename="output.jsonl"):
self.items_seen = set()
self.jsonl_filename = jsonl_filename
self.critical_fields = ['name', 'price', 'productId']
def validate(self, data: ScrapedData):
"""Ensure critical fields are present and logical."""
if not data.name or len(data.name) < 2:
raise DataValidationError(f"Invalid Name: {data.name}")
if data.price <= 0.0 and data.availability == "in_stock":
# This is a common AI-scraper failure point
raise DataValidationError(f"Zero price on in-stock item: {data.productId}")
def add_data(self, scraped_data: ScrapedData):
try:
self.validate(scraped_data)
# ... existing duplicate checks and saving logic ...
except DataValidationError as e:
logger.warning("data_validation_failed", extra={"context": {"reason": str(e)}})
Step 3: Operational Metrics and Health Checks
To understand if a scraper is healthy, we need to track metrics over the entire lifecycle of the job. This ScraperMonitor class tracks successes, failures, and data quality issues.
class ScraperMonitor:
def __init__(self):
self.stats = {
"start_time": datetime.now().isoformat(),
"pages_processed": 0,
"success_count": 0,
"validation_errors": 0,
"network_errors": 0
}
def record_event(self, event_type: str):
if event_type in self.stats:
self.stats[event_type] += 1
self.stats["pages_processed"] += 1
def get_report(self):
duration = datetime.now() - datetime.fromisoformat(self.stats["start_time"])
self.stats["duration_seconds"] = duration.total_seconds()
return self.stats
monitor = ScraperMonitor()
Step 4: Putting It Together
We can now integrate these patterns into the main execution flow. The main loop uses the monitor to provide a final job summary, which is vital for debugging cron jobs.
async def main():
urls = ["https://www.dermstore.com/wellbel-women-supplement-60-capsules/13495445.html"]
pipeline = DataPipeline(generate_output_filename())
monitor = ScraperMonitor()
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
for url in urls:
try:
page = await browser.new_page()
await page.goto(url, timeout=60000)
data = await extract_data(page)
if data:
data.url = url
pipeline.add_data(data)
monitor.record_event("success_count")
await page.close()
except Exception as e:
monitor.record_event("network_errors")
logger.error("page_processing_failed", extra={"context": {"url": url, "error": str(e)}})
# Final Job Summary
report = monitor.get_report()
logger.info("job_completed", extra={"context": report})
# Alert if failure rate is too high
if report["pages_processed"] > 0:
success_rate = report["success_count"] / report["pages_processed"]
if success_rate < 0.8:
logger.critical("low_success_rate_alert", extra={"context": {"rate": success_rate}})
await browser.close()
if __name__ == "__main__":
asyncio.run(main())
To Wrap Up
AI-generated scrapers are a great starting point, but they are prototypes rather than finished products. Wrapping the AI's extraction logic in a professional operational layer ensures that data pipelines are resilient and observable.
Recommended approaches for productionizing scrapers:
- Use Structured Logging: Move from text strings to JSON objects to make errors searchable.
- Apply Active Validation: Don't assume AI selectors will work forever. Validate critical fields like price and name.
- Track Job Metrics: Monitor the success-to-failure ratio to trigger alerts when a site layout changes.
- Prioritize Context: Always log the URL and the specific error type to reduce debugging time.
If you use the Node.js implementations in the Dermstore repository, you can apply these same patterns using libraries like winston for logging and zod for data validation. For more advanced monitoring and proxy rotation, consider integrating the ScrapeOps Monitoring SDK.
Top comments (0)