DEV Community

Cover image for Inside domharvest-playwright: How I Architected a Production-Ready Web Scraping Tool
Max B.
Max B.

Posted on • Originally published at domharvest.github.io

Inside domharvest-playwright: How I Architected a Production-Ready Web Scraping Tool

When building domharvest-playwright, I wanted to create something that was both powerful and maintainable. Here's how I structured it to handle real-world web scraping challenges.

The Core Architecture

domharvest-playwright is built around three main components:

  1. DOMHarvester Class - The main orchestrator
  2. Browser Management - Playwright lifecycle handling
  3. Data Extraction Pipeline - Selector-based harvesting

Design Principles

Simplicity First
Every architectural decision prioritized simplicity over cleverness. No over-abstraction, no unnecessary patterns.

Fail Fast, Fail Clear
Errors should be obvious and actionable. No silent failures.

Composability
Small, focused methods that can be combined for complex workflows.

Browser Lifecycle Management

class DOMHarvester {
  async init(options = {}) {
    this.browser = await playwright.chromium.launch({
      headless: options.headless ?? true,
      ...options.browserOptions
    })
    this.context = await this.browser.newContext(options.contextOptions)
  }

  async close() {
    await this.context?.close()
    await this.browser?.close()
  }
}
Enter fullscreen mode Exit fullscreen mode

Why this approach?

  • Explicit initialization gives users control
  • Separate context management enables multiple sessions
  • Clean shutdown prevents resource leaks

The Harvesting Pipeline

The core harvest() method follows a straightforward flow:

async harvest(url, selector, extractor) {
  const page = await this.context.newPage()

  try {
    await page.goto(url, { waitUntil: 'networkidle' })

    const elements = await page.$$(selector)
    const results = []

    for (const element of elements) {
      const data = await element.evaluate(extractor)
      results.push(data)
    }

    return results
  } finally {
    await page.close()
  }
}
Enter fullscreen mode Exit fullscreen mode

Key decisions:

  • waitUntil: 'networkidle' balances speed and reliability
  • Sequential processing prevents race conditions
  • finally block ensures cleanup even on errors
  • Extractor function runs in browser context for performance

Error Handling Strategy

try {
  await page.goto(url, { 
    waitUntil: 'networkidle',
    timeout: 30000 
  })
} catch (error) {
  if (error.name === 'TimeoutError') {
    throw new Error(`Failed to load ${url}: timeout after 30s`)
  }
  throw error
}
Enter fullscreen mode Exit fullscreen mode

I wrap Playwright errors with context-specific messages. This helps users debug without diving into stack traces.

Custom Extraction Support

Beyond selector-based harvesting, harvestCustom() allows arbitrary page evaluation:

async harvestCustom(url, evaluator) {
  const page = await this.context.newPage()

  try {
    await page.goto(url, { waitUntil: 'networkidle' })
    return await page.evaluate(evaluator)
  } finally {
    await page.close()
  }
}
Enter fullscreen mode Exit fullscreen mode

This enables complex scenarios like:

  • Multi-step interactions
  • Conditional logic based on page state
  • Aggregating data from multiple sources

Testing Architecture

Tests are organized by concern:

test/
├── unit/
│   ├── harvester.test.js
│   └── browser-management.test.js
├── integration/
│   └── harvest-workflow.test.js
└── fixtures/
    └── sample-pages/
Enter fullscreen mode Exit fullscreen mode

Using real HTML fixtures instead of mocking ensures tests catch real-world issues.

Performance Considerations

Page Reuse vs. Clean State
I chose to create new pages per harvest for isolation. Slight performance cost, but eliminates entire classes of bugs.

Parallel vs. Sequential
Sequential processing is the default for predictability. Users can parallelize at the application level if needed.

Memory Management
Explicit page cleanup in finally blocks prevents memory leaks during long-running sessions.

Code Organization

src/
├── index.js          # Public API
├── harvester.js      # DOMHarvester class
└── utils/
    ├── validators.js # Input validation
    └── errors.js     # Custom error types
Enter fullscreen mode Exit fullscreen mode

Flat structure. No deep nesting. Easy to navigate.

Lessons Learned

1. Don't Abstract Too Early
I resisted creating a "Strategy Pattern" for different scraping modes. YAGNI was right.

2. Explicit > Implicit
Requiring init() before use feels verbose but prevents confusing initialization bugs.

3. Browser Automation is I/O Heavy
Network latency dominates. Focus on reliability over micro-optimizations.

4. Error Messages Matter
Users will see errors more than code. Make them helpful.

What's Next?

Future architectural improvements I'm considering:

  • [ ] Plugin system for custom middleware
  • [ ] Built-in retry logic with exponential backoff
  • [ ] Request/response interception hooks
  • [ ] Streaming results for large datasets

Try It Yourself

npm install domharvest-playwright
Enter fullscreen mode Exit fullscreen mode

The architecture is intentionally simple. Read the source - it's under 500 lines.


Links:

What architectural patterns do you use in your scraping tools? Let me know in the comments!

Top comments (0)