DEV Community

Cover image for Building a Production Web Scraper: A Real-World Case Study
Max B.
Max B.

Posted on • Originally published at domharvest.github.io

Building a Production Web Scraper: A Real-World Case Study

Building a web scraper that works on your laptop is easy. Making it reliable in production is hard. Here's how I used domharvest-playwright to build a scraper that's been running smoothly for months.

The Challenge

Goal: Scrape product listings from an e-commerce site daily
Volume: ~10,000 products
Requirements:

  • Run daily at 2 AM UTC
  • Handle pagination (200+ pages)
  • Detect and skip unchanged products
  • Alert on failures
  • Store results in PostgreSQL

Why domharvest-playwright?

I evaluated several options:

Tool Pro Con
Cheerio Fast No JavaScript execution
Puppeteer Powerful Complex API
Scrapy Battle-tested Python (team uses JS)
domharvest-playwright Simple + JS rendering New tool

domharvest won because it handled JavaScript-heavy pages without the complexity of raw Playwright.

Architecture Overview

┌─────────────┐
│   Cron Job  │
└──────┬──────┘
       │
       ▼
┌─────────────────┐
│  Scraper Worker │
│  (Node.js)      │
└────────┬────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌───────┐  ┌──────────┐
│ Page  │  │ Product  │
│ Queue │  │ Extractor│
└───┬───┘  └────┬─────┘
    │           │
    └─────┬─────┘
          ▼
    ┌──────────┐
    │PostgreSQL│
    └──────────┘
Enter fullscreen mode Exit fullscreen mode

Implementation

1. Page Queue System

Handle pagination reliably:

import { DOMHarvester } from 'domharvest-playwright'

async function scrapeAllPages(baseUrl, maxPages = 200) {
  const harvester = new DOMHarvester({ headless: true })
  await harvester.init()

  const allProducts = []

  for (let page = 1; page <= maxPages; page++) {
    const url = `${baseUrl}?page=${page}`

    try {
      const products = await harvester.harvest(
        url,
        '.product-card',
        extractProduct
      )

      if (products.length === 0) break // No more pages

      allProducts.push(...products)
      console.log(`Page ${page}: ${products.length} products`)

      // Rate limiting
      await sleep(1000)

    } catch (error) {
      console.error(`Failed on page ${page}:`, error)
      // Continue to next page instead of failing completely
    }
  }

  await harvester.close()
  return allProducts
}
Enter fullscreen mode Exit fullscreen mode

Key decisions:

  • Break on empty results (detect end of pagination)
  • Continue on individual page failures
  • Built-in rate limiting (1 req/sec)

2. Product Extraction

Extract structured data consistently:

function extractProduct(element) {
  return {
    id: element.querySelector('[data-product-id]')
      ?.getAttribute('data-product-id'),
    name: element.querySelector('.product-name')
      ?.textContent?.trim(),
    price: parsePrice(
      element.querySelector('.price')?.textContent
    ),
    imageUrl: element.querySelector('.product-image')
      ?.getAttribute('src'),
    inStock: !element.querySelector('.out-of-stock'),
    url: element.querySelector('a')
      ?.getAttribute('href'),
    scrapedAt: new Date().toISOString()
  }
}

function parsePrice(priceText) {
  if (!priceText) return null
  const match = priceText.match(/[\d,]+\.?\d*/)?.[0]
  return match ? parseFloat(match.replace(',', '')) : null
}
Enter fullscreen mode Exit fullscreen mode

Defensive extraction:

  • Optional chaining everywhere (?.)
  • Trim whitespace
  • Parse prices consistently
  • Handle missing elements gracefully

3. Change Detection

Only store what changed:

import { createHash } from 'crypto'

async function saveProducts(products, db) {
  let newCount = 0
  let updatedCount = 0

  for (const product of products) {
    const hash = hashProduct(product)
    const existing = await db.findProduct(product.id)

    if (!existing) {
      await db.insertProduct({ ...product, hash })
      newCount++
    } else if (existing.hash !== hash) {
      await db.updateProduct(product.id, { ...product, hash })
      updatedCount++
    }
    // Skip unchanged products
  }

  return { newCount, updatedCount }
}

function hashProduct(product) {
  const relevant = {
    price: product.price,
    inStock: product.inStock,
    name: product.name
  }
  return createHash('md5')
    .update(JSON.stringify(relevant))
    .digest('hex')
}
Enter fullscreen mode Exit fullscreen mode

This reduced database writes by 90%.

4. Error Handling & Alerts

Fail gracefully, alert loudly:

async function runScraper() {
  const startTime = Date.now()
  let status = 'success'
  let errorMessage = null

  try {
    const products = await scrapeAllPages(
      'https://example.com/products',
      200
    )

    const { newCount, updatedCount } = await saveProducts(
      products,
      db
    )

    await sendMetrics({
      totalProducts: products.length,
      newProducts: newCount,
      updatedProducts: updatedCount,
      duration: Date.now() - startTime
    })

  } catch (error) {
    status = 'failed'
    errorMessage = error.message

    await sendAlert({
      message: `Scraper failed: ${error.message}`,
      stack: error.stack,
      timestamp: new Date().toISOString()
    })

    throw error
  } finally {
    await db.close()
  }
}
Enter fullscreen mode Exit fullscreen mode

Alerts go to Slack via webhook when things break.

5. Deployment

Running on a small VPS with systemd:

# /etc/systemd/system/product-scraper.timer
[Unit]
Description=Daily product scraper

[Timer]
OnCalendar=daily
OnCalendar=02:00 UTC
Persistent=true

[Install]
WantedBy=timers.target
Enter fullscreen mode Exit fullscreen mode
# /etc/systemd/system/product-scraper.service
[Unit]
Description=Product Scraper Service

[Service]
Type=oneshot
User=scraper
WorkingDirectory=/opt/scraper
ExecStart=/usr/bin/node index.js
StandardOutput=journal
StandardError=journal
Enter fullscreen mode Exit fullscreen mode

Challenges & Solutions

Challenge 1: Memory Leaks

Problem: After ~5000 products, Node.js process would crash (OOM)

Solution: Process in batches, close pages explicitly

const BATCH_SIZE = 100

for (let i = 0; i < products.length; i += BATCH_SIZE) {
  const batch = products.slice(i, i + BATCH_SIZE)
  await processBatch(batch)

  // Force GC hint
  if (global.gc) global.gc()
}
Enter fullscreen mode Exit fullscreen mode

Challenge 2: Flaky Selectors

Problem: Site occasionally changed class names

Solution: Multiple fallback selectors

function extractPrice(element) {
  const selectors = [
    '.price-current',
    '.product-price',
    '[data-price]'
  ]

  for (const selector of selectors) {
    const el = element.querySelector(selector)
    if (el) return parsePrice(el.textContent)
  }

  return null // Price not found
}
Enter fullscreen mode Exit fullscreen mode

Challenge 3: Rate Limiting

Problem: Got blocked after ~100 pages

Solution: Randomized delays + user agent rotation

const USER_AGENTS = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
  // ...more
]

const harvester = new DOMHarvester({
  contextOptions: {
    userAgent: USER_AGENTS[
      Math.floor(Math.random() * USER_AGENTS.length)
    ]
  }
})

// Random delay 1-3 seconds
await sleep(1000 + Math.random() * 2000)
Enter fullscreen mode Exit fullscreen mode

Results

After 3 months in production:

  • ✅ 99.2% uptime
  • ✅ ~300,000 products scraped
  • ✅ Average runtime: 45 minutes
  • ✅ Zero manual interventions needed
  • ✅ Database size: 2.3 GB

Lessons Learned

  1. Batch processing is essential for large datasets
  2. Change detection saves money on storage and bandwidth
  3. Graceful degradation > perfection - skip failed pages, continue scraping
  4. Monitoring is non-negotiable - you need to know when things break
  5. Rate limiting is ethical and practical - don't hammer servers

Code Simplicity Matters

The entire scraper is ~400 lines of code. domharvest-playwright's simple API meant I could focus on business logic, not browser automation complexity.

Try It Yourself

npm install domharvest-playwright
Enter fullscreen mode Exit fullscreen mode

The patterns here work for scraping anything from product listings to job boards to real estate sites.


Links:

What's your experience with production scrapers? Share in the comments!

Top comments (0)