Jerry A. Henley

Posted on Feb 25

Production-Grade Costco Scraping: Error Budgets, Retries, and Pipelines

#webscraping #python #node #dataengineering

Building a basic web scraper for a handful of pages is a straightforward task. However, when you scale that script to scrape thousands of products from a major retailer like Costco, the complexity shifts from parsing HTML to keeping the process alive.

Costco employs aggressive anti-bot measures, strict rate limits, and a massive catalog that can easily overwhelm a naive script. If your scraper stores data in memory and crashes at product #999, you lose everything. If it gets blocked by a 403 Forbidden error, the entire run stops.

To build a reliable Costco scraper in Node.js, you need to focus on fault tolerance, stream writing, and proxy management. We can look at the architectural patterns found in the Costco.com-Scrapers repository to see how this works in practice.

Prerequisites

To follow along, you should have:

Node.js installed (v16+ recommended).
A basic understanding of Puppeteer or Playwright.
A ScrapeOps API Key for proxy rotation.

The Configuration Strategy: Error Budgets

Production scraping requires a shift in mindset. You must assume that network requests will fail. Instead of using hardcoded values, define a CONFIG object that establishes an "error budget."

The configuration pattern used in the repository's costco_scraper_product_data_v1.js looks like this:

const CONFIG = {
    maxRetries: 3,
    maxConcurrency: 1,
    timeout: 180000, // 3 minutes
    outputFile: generateOutputFilename()
};

Why these numbers matter:

Timeout Math: Standard 30-second timeouts are often too aggressive for Costco. When using residential proxies or waiting for heavy JavaScript to render, pages can take over a minute to stabilize. Setting the timeout to 3 minutes ensures you don't kill a request that was actually succeeding slowly.
Concurrency Control: While it's tempting to set concurrency to 10 or 20, Costco's anti-bot system is sensitive to request patterns. Starting with a single concurrent request prioritizes session stability over raw speed.
The Error Budget: The maxRetries setting defines the budget. If a page fails three times, the script logs the failure and moves on rather than hanging indefinitely.

Proxy Rotation with ScrapeOps

Costco quickly flags and bans datacenter IP addresses. To scrape at scale, you need residential proxies that rotate with every request. The repository integrates ScrapeOps to handle this automatically.

Configure the proxy server within the Puppeteer or Playwright setup:

const PROXY_SERVER = 'residential-proxy.scrapeops.io';
const PROXY_PORT = '8181';
const PROXY_USERNAME = 'scrapeops';
const PROXY_PASSWORD = API_KEY;

const browser = await puppeteer.launch({
    args: [`--proxy-server=${PROXY_SERVER}:${PROXY_PORT}`]
});

By routing requests through a residential proxy, the scraper appears to Costco as a legitimate home user. When combined with the puppeteer-extra-plugin-stealth, you can bypass common fingerprinting techniques that trigger CAPTCHAs or 403 errors.

The DataPipeline Class: Stream Writing

A common mistake in web scraping is storing all results in a large array and saving it to a file at the very end.

This is risky. If the script crashes after four hours of work, the variable is cleared and your data is gone. Furthermore, storing 50,000 product objects in memory can lead to memory exhaustion.

The repository solves this using a DataPipeline class that implements JSONL (JSON Lines) stream writing:

class DataPipeline {
    constructor(outputFile = CONFIG.outputFile) {
        this.itemsSeen = new Set();
        this.outputFile = outputFile;
        this.writeFile = promisify(fs.appendFile);
    }

    async addData(scrapedData) {
        if (!this.isDuplicate(scrapedData)) {
            try {
                // Convert object to string and append a newline
                const jsonLine = JSON.stringify(scrapedData) + '\n';
                // Write immediately to disk
                await this.writeFile(this.outputFile, jsonLine, 'utf8');
                console.log('Saved item to', this.outputFile);
            } catch (error) {
                console.error('Error saving data:', error);
            }
        }
    }
}

The Benefits of JSONL:

Persistence: Every time addData is called, the item is physically written to the disk. A crash only loses the current item being processed.
Memory Efficiency: You only hold one product object in memory at a time.
Database Friendly: Most modern data warehouses like BigQuery or Snowflake prefer JSONL for bulk imports.

Deduplication Logic

When running a scraper with retries, there is a risk of saving the same data twice if a network hiccup occurs after the data is fetched but before the request is marked complete.

The DataPipeline uses a Set called itemsSeen to prevent this. Before writing to the file, the isDuplicate method checks if the item is already recorded.

isDuplicate(data) {
    const itemKey = data.productId || JSON.stringify(data);
    if (this.itemsSeen.has(itemKey)) {
        console.warn('Duplicate item found, skipping');
        return true;
    }
    this.itemsSeen.add(itemKey);
    return false;
}

Using the Costco productId as a key ensures the final dataset is clean, saving hours of post-processing work.

Implementing Checkpointing: Resuming Scrapes

Production scrapers often need to resume after a fatal error, such as a power outage or system update. Currently, if you restart the script, the itemsSeen Set is empty, meaning the scraper might re-process thousands of items it already saved.

You can upgrade the DataPipeline to support checkpointing by reading the existing file on startup.

The Upgrade: Adding Checkpointing

Add this logic to the DataPipeline constructor to make the scraper resume-aware:

class DataPipeline {
    constructor(outputFile = CONFIG.outputFile) {
        this.outputFile = outputFile;
        this.itemsSeen = new Set();
        this.writeFile = promisify(fs.appendFile);

        if (fs.existsSync(this.outputFile)) {
            console.log(`Loading existing data from ${this.outputFile}...`);
            const fileContent = fs.readFileSync(this.outputFile, 'utf8');
            fileContent.split('\n').forEach(line => {
                if (line.trim()) {
                    try {
                        const item = JSON.parse(line);
                        this.itemsSeen.add(item.productId || JSON.stringify(item));
                    } catch (e) {
                        // Skip malformed lines
                    }
                }
            });
            console.log(`Resuming with ${this.itemsSeen.size} items already indexed.`);
        }
    }
}

With this modification, the scraper becomes much more resilient. You can stop and start the process at will, and it will only scrape what is missing from the output file.

To Wrap Up

Building a production-grade Costco scraper requires moving beyond simple DOM parsing and focusing on data integrity and network resilience.

Use Error Budgets: Set long timeouts and reasonable retries to handle network variance.
Rotate Residential Proxies: Use a service like ScrapeOps to bypass IP-based blocking.
Stream to JSONL: Write data to disk line-by-line to prevent data loss.
Implement Checkpointing: Populate your deduplication Set from existing files to allow the scraper to resume after a crash.

To see these patterns in action, you can clone the official repository and try implementing the checkpointing logic yourself.

DEV Community