Max B.

Posted on Jan 9 • Edited on Jan 16 • Originally published at blog.domharvest.dev

Inside domharvest-playwright: How I Architected a Production-Ready Web Scraping Tool

#webdev #javascript #opensource #webscraping

When building domharvest-playwright, I wanted to create something that was both powerful and maintainable. Here's how I structured it to handle real-world web scraping challenges.

The Core Architecture

domharvest-playwright is built around three main components:

DOMHarvester Class - The main orchestrator
Browser Management - Playwright lifecycle handling
Data Extraction Pipeline - Selector-based harvesting

Design Principles

Simplicity First
Every architectural decision prioritized simplicity over cleverness. No over-abstraction, no unnecessary patterns.

Fail Fast, Fail Clear
Errors should be obvious and actionable. No silent failures.

Composability
Small, focused methods that can be combined for complex workflows.

Browser Lifecycle Management

class DOMHarvester {
  async init(options = {}) {
    this.browser = await playwright.chromium.launch({
      headless: options.headless ?? true,
      ...options.browserOptions
    })
    this.context = await this.browser.newContext(options.contextOptions)
  }

  async close() {
    await this.context?.close()
    await this.browser?.close()
  }
}

Why this approach?

Explicit initialization gives users control
Separate context management enables multiple sessions
Clean shutdown prevents resource leaks

The Harvesting Pipeline

The core harvest() method follows a straightforward flow:

async harvest(url, selector, extractor) {
  const page = await this.context.newPage()

  try {
    await page.goto(url, { waitUntil: 'networkidle' })

    const elements = await page.$$(selector)
    const results = []

    for (const element of elements) {
      const data = await element.evaluate(extractor)
      results.push(data)
    }

    return results
  } finally {
    await page.close()
  }
}

Key decisions:

waitUntil: 'networkidle' balances speed and reliability
Sequential processing prevents race conditions
finally block ensures cleanup even on errors
Extractor function runs in browser context for performance

Error Handling Strategy

try {
  await page.goto(url, { 
    waitUntil: 'networkidle',
    timeout: 30000 
  })
} catch (error) {
  if (error.name === 'TimeoutError') {
    throw new Error(`Failed to load ${url}: timeout after 30s`)
  }
  throw error
}

I wrap Playwright errors with context-specific messages. This helps users debug without diving into stack traces.

Custom Extraction Support

Beyond selector-based harvesting, harvestCustom() allows arbitrary page evaluation:

async harvestCustom(url, evaluator) {
  const page = await this.context.newPage()

  try {
    await page.goto(url, { waitUntil: 'networkidle' })
    return await page.evaluate(evaluator)
  } finally {
    await page.close()
  }
}

This enables complex scenarios like:

Multi-step interactions
Conditional logic based on page state
Aggregating data from multiple sources

Testing Architecture

Tests are organized by concern:

test/
├── unit/
│   ├── harvester.test.js
│   └── browser-management.test.js
├── integration/
│   └── harvest-workflow.test.js
└── fixtures/
    └── sample-pages/

Using real HTML fixtures instead of mocking ensures tests catch real-world issues.

Performance Considerations

Page Reuse vs. Clean State
I chose to create new pages per harvest for isolation. Slight performance cost, but eliminates entire classes of bugs.

Parallel vs. Sequential
Sequential processing is the default for predictability. Users can parallelize at the application level if needed.

Memory Management
Explicit page cleanup in finally blocks prevents memory leaks during long-running sessions.

Code Organization

src/
├── index.js          # Public API
├── harvester.js      # DOMHarvester class
└── utils/
    ├── validators.js # Input validation
    └── errors.js     # Custom error types

Flat structure. No deep nesting. Easy to navigate.

Lessons Learned

1. Don't Abstract Too Early
I resisted creating a "Strategy Pattern" for different scraping modes. YAGNI was right.

2. Explicit > Implicit
Requiring init() before use feels verbose but prevents confusing initialization bugs.

3. Browser Automation is I/O Heavy
Network latency dominates. Focus on reliability over micro-optimizations.

4. Error Messages Matter
Users will see errors more than code. Make them helpful.

What's Next?

Future architectural improvements I'm considering:

[ ] Plugin system for custom middleware
[ ] Built-in retry logic with exponential backoff
[ ] Request/response interception hooks
[ ] Streaming results for large datasets

Try It Yourself

npm install domharvest-playwright

The architecture is intentionally simple. Read the source - it's under 500 lines.

Links:

What architectural patterns do you use in your scraping tools? Let me know in the comments!

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.