Building a web scraper that works on your laptop is easy. Making it reliable in production is hard. Here's how I used domharvest-playwright to build a scraper that's been running smoothly for months.
The Challenge
Goal: Scrape product listings from an e-commerce site daily
Volume: ~10,000 products
Requirements:
- Run daily at 2 AM UTC
- Handle pagination (200+ pages)
- Detect and skip unchanged products
- Alert on failures
- Store results in PostgreSQL
Why domharvest-playwright?
I evaluated several options:
| Tool | Pro | Con |
|---|---|---|
| Cheerio | Fast | No JavaScript execution |
| Puppeteer | Powerful | Complex API |
| Scrapy | Battle-tested | Python (team uses JS) |
| domharvest-playwright | Simple + JS rendering | New tool |
domharvest won because it handled JavaScript-heavy pages without the complexity of raw Playwright.
Architecture Overview
┌─────────────┐
│ Cron Job │
└──────┬──────┘
│
▼
┌─────────────────┐
│ Scraper Worker │
│ (Node.js) │
└────────┬────────┘
│
┌────┴────┐
▼ ▼
┌───────┐ ┌──────────┐
│ Page │ │ Product │
│ Queue │ │ Extractor│
└───┬───┘ └────┬─────┘
│ │
└─────┬─────┘
▼
┌──────────┐
│PostgreSQL│
└──────────┘
Implementation
1. Page Queue System
Handle pagination reliably:
import { DOMHarvester } from 'domharvest-playwright'
async function scrapeAllPages(baseUrl, maxPages = 200) {
const harvester = new DOMHarvester({ headless: true })
await harvester.init()
const allProducts = []
for (let page = 1; page <= maxPages; page++) {
const url = `${baseUrl}?page=${page}`
try {
const products = await harvester.harvest(
url,
'.product-card',
extractProduct
)
if (products.length === 0) break // No more pages
allProducts.push(...products)
console.log(`Page ${page}: ${products.length} products`)
// Rate limiting
await sleep(1000)
} catch (error) {
console.error(`Failed on page ${page}:`, error)
// Continue to next page instead of failing completely
}
}
await harvester.close()
return allProducts
}
Key decisions:
- Break on empty results (detect end of pagination)
- Continue on individual page failures
- Built-in rate limiting (1 req/sec)
2. Product Extraction
Extract structured data consistently:
function extractProduct(element) {
return {
id: element.querySelector('[data-product-id]')
?.getAttribute('data-product-id'),
name: element.querySelector('.product-name')
?.textContent?.trim(),
price: parsePrice(
element.querySelector('.price')?.textContent
),
imageUrl: element.querySelector('.product-image')
?.getAttribute('src'),
inStock: !element.querySelector('.out-of-stock'),
url: element.querySelector('a')
?.getAttribute('href'),
scrapedAt: new Date().toISOString()
}
}
function parsePrice(priceText) {
if (!priceText) return null
const match = priceText.match(/[\d,]+\.?\d*/)?.[0]
return match ? parseFloat(match.replace(',', '')) : null
}
Defensive extraction:
- Optional chaining everywhere (
?.) - Trim whitespace
- Parse prices consistently
- Handle missing elements gracefully
3. Change Detection
Only store what changed:
import { createHash } from 'crypto'
async function saveProducts(products, db) {
let newCount = 0
let updatedCount = 0
for (const product of products) {
const hash = hashProduct(product)
const existing = await db.findProduct(product.id)
if (!existing) {
await db.insertProduct({ ...product, hash })
newCount++
} else if (existing.hash !== hash) {
await db.updateProduct(product.id, { ...product, hash })
updatedCount++
}
// Skip unchanged products
}
return { newCount, updatedCount }
}
function hashProduct(product) {
const relevant = {
price: product.price,
inStock: product.inStock,
name: product.name
}
return createHash('md5')
.update(JSON.stringify(relevant))
.digest('hex')
}
This reduced database writes by 90%.
4. Error Handling & Alerts
Fail gracefully, alert loudly:
async function runScraper() {
const startTime = Date.now()
let status = 'success'
let errorMessage = null
try {
const products = await scrapeAllPages(
'https://example.com/products',
200
)
const { newCount, updatedCount } = await saveProducts(
products,
db
)
await sendMetrics({
totalProducts: products.length,
newProducts: newCount,
updatedProducts: updatedCount,
duration: Date.now() - startTime
})
} catch (error) {
status = 'failed'
errorMessage = error.message
await sendAlert({
message: `Scraper failed: ${error.message}`,
stack: error.stack,
timestamp: new Date().toISOString()
})
throw error
} finally {
await db.close()
}
}
Alerts go to Slack via webhook when things break.
5. Deployment
Running on a small VPS with systemd:
# /etc/systemd/system/product-scraper.timer
[Unit]
Description=Daily product scraper
[Timer]
OnCalendar=daily
OnCalendar=02:00 UTC
Persistent=true
[Install]
WantedBy=timers.target
# /etc/systemd/system/product-scraper.service
[Unit]
Description=Product Scraper Service
[Service]
Type=oneshot
User=scraper
WorkingDirectory=/opt/scraper
ExecStart=/usr/bin/node index.js
StandardOutput=journal
StandardError=journal
Challenges & Solutions
Challenge 1: Memory Leaks
Problem: After ~5000 products, Node.js process would crash (OOM)
Solution: Process in batches, close pages explicitly
const BATCH_SIZE = 100
for (let i = 0; i < products.length; i += BATCH_SIZE) {
const batch = products.slice(i, i + BATCH_SIZE)
await processBatch(batch)
// Force GC hint
if (global.gc) global.gc()
}
Challenge 2: Flaky Selectors
Problem: Site occasionally changed class names
Solution: Multiple fallback selectors
function extractPrice(element) {
const selectors = [
'.price-current',
'.product-price',
'[data-price]'
]
for (const selector of selectors) {
const el = element.querySelector(selector)
if (el) return parsePrice(el.textContent)
}
return null // Price not found
}
Challenge 3: Rate Limiting
Problem: Got blocked after ~100 pages
Solution: Randomized delays + user agent rotation
const USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
// ...more
]
const harvester = new DOMHarvester({
contextOptions: {
userAgent: USER_AGENTS[
Math.floor(Math.random() * USER_AGENTS.length)
]
}
})
// Random delay 1-3 seconds
await sleep(1000 + Math.random() * 2000)
Results
After 3 months in production:
- ✅ 99.2% uptime
- ✅ ~300,000 products scraped
- ✅ Average runtime: 45 minutes
- ✅ Zero manual interventions needed
- ✅ Database size: 2.3 GB
Lessons Learned
- Batch processing is essential for large datasets
- Change detection saves money on storage and bandwidth
- Graceful degradation > perfection - skip failed pages, continue scraping
- Monitoring is non-negotiable - you need to know when things break
- Rate limiting is ethical and practical - don't hammer servers
Code Simplicity Matters
The entire scraper is ~400 lines of code. domharvest-playwright's simple API meant I could focus on business logic, not browser automation complexity.
Try It Yourself
npm install domharvest-playwright
The patterns here work for scraping anything from product listings to job boards to real estate sites.
Links:
What's your experience with production scrapers? Share in the comments!
Top comments (0)