When building domharvest-playwright, I wanted to create something that was both powerful and maintainable. Here's how I structured it to handle real-world web scraping challenges.
The Core Architecture
domharvest-playwright is built around three main components:
- DOMHarvester Class - The main orchestrator
- Browser Management - Playwright lifecycle handling
- Data Extraction Pipeline - Selector-based harvesting
Design Principles
Simplicity First
Every architectural decision prioritized simplicity over cleverness. No over-abstraction, no unnecessary patterns.
Fail Fast, Fail Clear
Errors should be obvious and actionable. No silent failures.
Composability
Small, focused methods that can be combined for complex workflows.
Browser Lifecycle Management
class DOMHarvester {
async init(options = {}) {
this.browser = await playwright.chromium.launch({
headless: options.headless ?? true,
...options.browserOptions
})
this.context = await this.browser.newContext(options.contextOptions)
}
async close() {
await this.context?.close()
await this.browser?.close()
}
}
Why this approach?
- Explicit initialization gives users control
- Separate context management enables multiple sessions
- Clean shutdown prevents resource leaks
The Harvesting Pipeline
The core harvest() method follows a straightforward flow:
async harvest(url, selector, extractor) {
const page = await this.context.newPage()
try {
await page.goto(url, { waitUntil: 'networkidle' })
const elements = await page.$$(selector)
const results = []
for (const element of elements) {
const data = await element.evaluate(extractor)
results.push(data)
}
return results
} finally {
await page.close()
}
}
Key decisions:
-
waitUntil: 'networkidle'balances speed and reliability - Sequential processing prevents race conditions
-
finallyblock ensures cleanup even on errors - Extractor function runs in browser context for performance
Error Handling Strategy
try {
await page.goto(url, {
waitUntil: 'networkidle',
timeout: 30000
})
} catch (error) {
if (error.name === 'TimeoutError') {
throw new Error(`Failed to load ${url}: timeout after 30s`)
}
throw error
}
I wrap Playwright errors with context-specific messages. This helps users debug without diving into stack traces.
Custom Extraction Support
Beyond selector-based harvesting, harvestCustom() allows arbitrary page evaluation:
async harvestCustom(url, evaluator) {
const page = await this.context.newPage()
try {
await page.goto(url, { waitUntil: 'networkidle' })
return await page.evaluate(evaluator)
} finally {
await page.close()
}
}
This enables complex scenarios like:
- Multi-step interactions
- Conditional logic based on page state
- Aggregating data from multiple sources
Testing Architecture
Tests are organized by concern:
test/
├── unit/
│ ├── harvester.test.js
│ └── browser-management.test.js
├── integration/
│ └── harvest-workflow.test.js
└── fixtures/
└── sample-pages/
Using real HTML fixtures instead of mocking ensures tests catch real-world issues.
Performance Considerations
Page Reuse vs. Clean State
I chose to create new pages per harvest for isolation. Slight performance cost, but eliminates entire classes of bugs.
Parallel vs. Sequential
Sequential processing is the default for predictability. Users can parallelize at the application level if needed.
Memory Management
Explicit page cleanup in finally blocks prevents memory leaks during long-running sessions.
Code Organization
src/
├── index.js # Public API
├── harvester.js # DOMHarvester class
└── utils/
├── validators.js # Input validation
└── errors.js # Custom error types
Flat structure. No deep nesting. Easy to navigate.
Lessons Learned
1. Don't Abstract Too Early
I resisted creating a "Strategy Pattern" for different scraping modes. YAGNI was right.
2. Explicit > Implicit
Requiring init() before use feels verbose but prevents confusing initialization bugs.
3. Browser Automation is I/O Heavy
Network latency dominates. Focus on reliability over micro-optimizations.
4. Error Messages Matter
Users will see errors more than code. Make them helpful.
What's Next?
Future architectural improvements I'm considering:
- [ ] Plugin system for custom middleware
- [ ] Built-in retry logic with exponential backoff
- [ ] Request/response interception hooks
- [ ] Streaming results for large datasets
Try It Yourself
npm install domharvest-playwright
The architecture is intentionally simple. Read the source - it's under 500 lines.
Links:
What architectural patterns do you use in your scraping tools? Let me know in the comments!
Top comments (0)