Robert N. Gutierrez

Posted on Mar 1

The 5-Minute Repair: A Workflow for Surviving Site Redesigns

#webdev #ux #webscraping #productivity

It’s Monday morning, and your Slack is already blowing up. The product data that usually flows into your database from Dermstore.com has suddenly dried up. You check the logs, and the scraper is still "working"—it connects, it scrolls, and it exits without error—but every product price is 0 and every name is an empty string.

Dermstore, like many major e-commerce platforms, frequently updates its frontend. A simple change from <span class="product-price"> to <div class="price-v2"> is all it takes to break a traditional scraper. Most developers respond by diving into a multi-hour session of manual DOM inspection and trial-and-error regex.

There is a better way. By using a modular "Hot-Swap" workflow, you can repair broken extraction logic in under five minutes without touching your networking, proxy, or browser automation infrastructure. This guide shows you how to use the Dermstore.com-Scrapers repository to maintain your data pipelines with minimal friction.

Prerequisites

To follow this workflow, you need:

A basic understanding of Node.js or Python.
A cloned copy of the repository: git clone https://github.com/scraper-bank/Dermstore.com-Scrapers.git
Familiarity with CSS selectors or Cheerio/Playwright locators.

Phase 1: Diagnosing the Break

Before fixing the scraper, you need to know exactly how it broke. Web scrapers typically fail in one of two ways: Network Blocks or Layout Changes.

If you see a 403 Forbidden or 429 Too Many Requests error in your console, your proxy or anti-bot headers are the problem. If the script runs successfully but produces "hollow" data, the site has redesigned its layout.

In the Dermstore repository, look at the output generated in your .jsonl files. A silent failure looks like this:

{
  "name": "",
  "price": 0,
  "brand": "",
  "url": "https://www.dermstore.com/example-product.production",
  "availability": "out_of_stock"
}

If the url is correct but the name and price are missing, the extraction logic is outdated. The scraper reached the page, but it couldn't read it.

Phase 2: The Bottleneck of Manual Debugging

The traditional fix is slow. You open the browser, right-click "Inspect," find the new class name, update your code, and run the whole script again. If Dermstore has heavy anti-bot protections, just loading the page to test your new selector can take a minute per attempt.

We solve this by isolating the "Extraction Block." In our repository, the code is split between Infrastructure (Puppeteer/Playwright setup, proxy rotation, retries) and Logic (the extractData function).

By treating the extraction function as a standalone "brain," you can swap it out without re-testing the entire body of the scraper.

Phase 3: The Hot-Swap Workflow

Let’s walk through the repair process using the Node.js Puppeteer implementation located at node/puppeteer/product_data/scraper/dermstore_scraper_product_data_v1.js.

1. Locate the Target Function

Open the script and find the extractData function. This is a "pure" function: it takes the HTML (via Cheerio) and returns a JSON object.

/**
 * Extract structured data from HTML using Cheerio
 */
function extractData($, url) {
    try {
        // Extraction logic lives here
        const priceText = $("#product-price").text(); 
        outputData.price = parsePrice(priceText);

        return outputData;
    } catch (error) {
        console.error('Error:', error);
        return null;
    }
}

2. Regenerate the Logic

Instead of manually hunting for selectors, use an AI scraper builder or the ScrapeOps AI Generator to generate a new extraction function based on the updated Dermstore URL.

3. The Hot-Swap

Once you have the new logic, do not replace your entire file. Copy only the internal logic of the new extractData function and paste it into your existing script.

By keeping the DataPipeline class and the puppeteer-extra configuration untouched, you ensure your proxy settings and duplicate detection remain active. You are simply upgrading the logic while keeping the engine running.

4. Handle Schema Changes

Dermstore often moves data into application/ld+json scripts. If the new site version uses JSON-LD, your swapped function should prioritize it, as shown in the repo's implementation:

let jsonData = null;
$("script[type='application/ld+json']").each((i, el) => {
    const data = JSON.parse($(el).text());
    if (data["@type"] === "Product") {
        jsonData = data; // Use the structured data if available
    }
});

Phase 4: Regression Testing with Example Data

How do you know your repair didn't change the data format? If your database expects price, but your new logic returns product_price, your pipeline will break further downstream.

The repository includes an example-data folder for every scraper. Use the product_data.json file as a baseline.

Run your repaired scraper on a single Dermstore URL.
Compare the keys in your new .jsonl output against node/puppeteer/product_data/example-data/product_data.json.
Ensure that aggregateRating, brand, and productId still exist and follow the same types. For example, price should be a float, not a string.

Phase 5: Python vs. Node.js Implementation

This workflow isn't limited to JavaScript. If you use the Playwright Python implementation at python/playwright/product_data/scraper/dermstore_scraper_product_data_v1.py, the pattern is identical.

The Python implementation uses an async def extract_data(page: Page) function. While the syntax differs, the modular philosophy remains:

Component	Node.js (Puppeteer)	Python (Playwright)
Automation	`puppeteer.launch()`	`async_playwright()`
Logic Block	`function extractData($, url)`	`async def extract_data(page)`
Data Output	`DataPipeline.addData()`	`DataPipeline.add_data()`

Because both implementations in the repo follow this structure, you can apply the hot-swap method regardless of your preferred language.

To Wrap Up

Web scraper maintenance doesn't have to be a chore. By isolating your extraction logic from your browser infrastructure, you turn a complex debugging session into a simple "swap and verify" procedure.

Key Takeaways:

Diagnose First: Distinguish between network blocks and selector breaks by checking for empty data versus error codes.
Modularize: Keep your extractData function separate from your proxy and browser setup.
Verify Schema: Use the repository's example-data to ensure your fix doesn't break your database downstream.
Automate Repairs: Use AI tools to generate selectors, then manually hot-swap them into your proven infrastructure.

To get started, clone the Dermstore Scrapers repository and run the existing scripts. When the next site update hits, you’ll be ready to fix it in five minutes flat.

DEV Community