Erika S. Adkins

Posted on Mar 5

Refactoring the Nike Scraper: Replacing In-Memory Deduplication with SQLite

#webscraping #devops #sql #python

Building a web scraper is usually straightforward. Building a resilient scraper that handles thousands of URLs without flinching is where the challenge lies. If you have been following our Nike.in-Scrapers repository, you will notice the Node.js implementations use a common pattern for deduplication: an in-memory Set.

While this works for small batches, it has a major flaw: volatility. If your script crashes at 90% completion because of a network timeout or a proxy error, your Set vanishes. When you restart, you start from zero, re-scraping data you already have.

This guide walks through refactoring the DataPipeline class in our Nike Puppeteer scraper to use SQLite. This change transforms the scraper from a one-shot script into a resumable data pipeline.

The Problem: In-Memory Deduplication

The current state of the pipeline in node/puppeteer/product_data/scraper/nike_scraper_product_data_v1.js looks like this:

class DataPipeline {
    constructor(outputFile = CONFIG.outputFile) {
        this.itemsSeen = new Set(); // The volatile culprit
        this.outputFile = outputFile;
        this.writeFile = promisify(fs.appendFile);
    }

    isDuplicate(data) {
        const itemKey = JSON.stringify(data);
        if (this.itemsSeen.has(itemKey)) {
            console.warn('Duplicate item found, skipping');
            return true;
        }
        this.itemsSeen.add(itemKey);
        return false;
    }
    // ... addData logic
}

This implementation has three main issues:

No Persistence: If the process exits, the itemsSeen memory is wiped.
RAM Bloat: As you scrape tens of thousands of products, the Set grows, which can lead to memory errors.
Inefficient Keys: Using JSON.stringify(data) as a key is slow and memory-intensive.

The Solution: `better-sqlite3`

To fix this, we need a persistent memory that lives on the disk. SQLite is a great candidate because it is serverless, requires no configuration, and stores everything in a single file.

We will use the better-sqlite3 library. Unlike the standard sqlite3 package, it has a synchronous API that fits the iterative loops of a web scraper, and it is much faster for high-volume writes.

Prerequisites

To follow along, clone the repository and install the library:

git clone https://github.com/scraper-bank/Nike.in-Scrapers.git
cd Nike.in-Scrapers
npm install better-sqlite3

Refactoring the DataPipeline Class

We can swap the Set for a database connection while keeping the interface of the DataPipeline similar. This avoids a complete rewrite of the scraper logic.

Step 1: Initialization and Schema

First, set up the database in the constructor. We need a table to store the unique identifiers of products we have already processed.

const Database = require('better-sqlite3');

class SqliteDataPipeline {
    constructor(dbPath = 'scraper_history.db', outputFile = CONFIG.outputFile) {
        this.db = new Database(dbPath);
        this.outputFile = outputFile;
        this.writeFile = promisify(fs.appendFile);

        // Create the table if it doesn't exist
        this.db.exec(`
            CREATE TABLE IF NOT EXISTS seen_items (
                id TEXT PRIMARY KEY,
                timestamp INTEGER
            )
        `);

        // Prepare statements for performance
        this.existsStmt = this.db.prepare('SELECT 1 FROM seen_items WHERE id = ?');
        this.insertStmt = this.db.prepare('INSERT INTO seen_items (id, timestamp) VALUES (?, ?)');
    }
}

Step 2: Efficient Deduplication Logic

Instead of checking a Set, we query the database. We will also move away from stringifying the whole object. Nike products usually have a productId or a unique url we can use instead.

    isDuplicate(productId) {
        if (!productId) return false;

        const row = this.existsStmt.get(productId);
        if (row) {
            console.warn(`Item ${productId} already scraped. Skipping...`);
            return true;
        }
        return false;
    }

    async recordItem(productId) {
        this.insertStmt.run(productId, Date.now());
    }

Step 3: Integrating with the Save Loop

Finally, update the addData method to tie it together. The script checks the database before saving to the .jsonl file.

    async addData(scrapedData) {
        const pId = scrapedData.productId || scrapedData.url;

        if (!this.isDuplicate(pId)) {
            try {
                const jsonLine = JSON.stringify(scrapedData) + '\n';
                await this.writeFile(this.outputFile, jsonLine, 'utf8');

                // Persist to DB so we don't scrape this again next time
                this.recordItem(pId);

                console.log('Saved item:', pId);
            } catch (error) {
                console.error('Error saving data:', error);
            }
        }
    }

Handling Graceful Shutdowns

When using file-based databases like SQLite, it is best to close the connection when the script finishes or is interrupted. This prevents database corruption.

Add a listener to the main execution flow:

const pipeline = new SqliteDataPipeline();

process.on('SIGINT', () => {
    console.log('\nClosing database connection...');
    pipeline.db.close();
    process.exit();
});

The Complete Refactored Pipeline

The updated DataPipeline now uses a reliable SQLite backend instead of a volatile Set.

const Database = require('better-sqlite3');
const fs = require('fs');
const { promisify } = require('util');

class SqliteDataPipeline {
    constructor(dbPath = 'nike_history.db', outputFile = CONFIG.outputFile) {
        this.db = new Database(dbPath);
        this.outputFile = outputFile;
        this.writeFile = promisify(fs.appendFile);

        this.db.exec(`
            CREATE TABLE IF NOT EXISTS seen_items (
                id TEXT PRIMARY KEY,
                timestamp INTEGER
            ) WITHOUT ROWID; -- Optimization for simple key-value storage
        `);

        this.existsStmt = this.db.prepare('SELECT 1 FROM seen_items WHERE id = ?');
        this.insertStmt = this.db.prepare('INSERT INTO seen_items (id, timestamp) VALUES (?, ?)');
    }

    async addData(scrapedData) {
        const pId = scrapedData.productId || scrapedData.url;

        if (!pId) {
            console.error("Could not find a unique ID for item");
            return;
        }

        if (this.existsStmt.get(pId)) {
            console.warn(`Duplicate found in DB: ${pId}. Skipping.`);
            return;
        }

        try {
            const jsonLine = JSON.stringify(scrapedData) + '\n';
            await this.writeFile(this.outputFile, jsonLine, 'utf8');
            this.insertStmt.run(pId, Date.now());
            console.log('Successfully persisted:', pId);
        } catch (error) {
            console.error('Persistence error:', error);
        }
    }
}

Testing Resilience

The value of this refactor is best seen by testing a failure:

Start the Scraper: Run your script and let it scrape 10–20 products.
Force a Crash: Use Ctrl+C to kill the process.
Restart: Run the script again.

Instead of re-saving the first 20 products, the console will show "Duplicate found in DB... skipping." The scraper is now resumable, allowing it to skip previously processed URLs without wasting bandwidth or proxy credits.

To Wrap Up

Moving deduplication logic from RAM to SQLite improves the architecture of the Nike scraper. This approach is more memory-efficient and provides the persistence needed for large-scale data extraction.

Key Takeaways

In-memory storage is fine for small scripts, but persistent storage is required for production pipelines.
SQLite maintains state across script restarts with no configuration needed.
better-sqlite3 provides a high-performance API that simplifies scraping logic.
Deduplicate early to save bandwidth and avoid triggering anti-bot detection with redundant requests.

If you are starting a new project, you can use the ScrapeOps AI Scraper Generator to build the base extraction logic, then add your own persistence layer for production use.

DEV Community

Refactoring the Nike Scraper: Replacing In-Memory Deduplication with SQLite

The Problem: In-Memory Deduplication

The Solution: `better-sqlite3`

Prerequisites

Refactoring the DataPipeline Class

Step 1: Initialization and Schema

Step 2: Efficient Deduplication Logic

Step 3: Integrating with the Save Loop

Handling Graceful Shutdowns

The Complete Refactored Pipeline

Testing Resilience

To Wrap Up

Key Takeaways

Top comments (0)

The Problem: In-Memory Deduplication

The Solution: better-sqlite3

Prerequisites

Refactoring the DataPipeline Class

Step 1: Initialization and Schema

Step 2: Efficient Deduplication Logic

Step 3: Integrating with the Save Loop

Handling Graceful Shutdowns

The Complete Refactored Pipeline

Testing Resilience

To Wrap Up

Key Takeaways

The Solution: `better-sqlite3`