Six months ago, I was frustrated with existing web scraping tools. They were either too simple (Cheerio couldn't handle JavaScript) or too complex (raw Playwright had too much boilerplate). So I built domharvest-playwright.
Here's the complete story of how I went from idea to published npm package.
The Problem That Started It All
I needed to scrape product data from a heavily JavaScript-dependent e-commerce site. My options:
Cheerio: Fast but can't execute JavaScript
// Doesn't work on modern SPAs
const $ = cheerio.load(html)
$('.product').each(...) // Empty - content loaded via JS!
Puppeteer/Playwright: Powerful but verbose
// 20+ lines just to extract some divs
const browser = await playwright.chromium.launch()
const context = await browser.newContext()
const page = await context.newPage()
await page.goto(url)
const elements = await page.$$('.product')
// ...more boilerplate
await browser.close()
I wanted something in between: JavaScript rendering + simple API.
Design Goals
Before writing code, I defined what "success" looked like:
- Simple API - Scrape data in 5 lines of code
- Handles JavaScript - Modern SPAs shouldn't be a problem
- Reliable - Works in production, not just demos
- Zero config - Smart defaults, optional customization
- Well-tested - I'm not debugging scraper bugs at 2 AM
The API Design Process
Iteration 1: Too Simple
// First attempt - too limited
const data = await scrape(url, '.product')
Problem: No control over extraction logic
Iteration 2: Too Complex
// Second attempt - too much config
const harvester = new Harvester({
browser: 'chromium',
headless: true,
timeout: 30000,
waitUntil: 'networkidle',
extractor: new Extractor({
mode: 'selector',
transform: true
})
})
Problem: Configuration hell before writing actual scraping code
Final Design: Just Right
// Final API - simple but flexible
import { harvest } from 'domharvest-playwright'
const products = await harvest(
'https://example.com',
'.product',
(el) => ({
name: el.querySelector('.name')?.textContent,
price: el.querySelector('.price')?.textContent
})
)
Why this works:
- One function call for simple cases
- Extractor function gives full control
- Runs in browser context (fast)
- Type-safe with JSDoc
Technical Decisions
Playwright Over Puppeteer
Initially considered Puppeteer, but Playwright won because:
- Better API: More intuitive method names
- Multi-browser: Chrome, Firefox, WebKit out of the box
- Auto-wait: Built-in waiting for elements
- Active development: Microsoft backing
JavaScript Over TypeScript
Controversial choice. Here's why:
Pros of sticking with JS:
- Lower barrier to contribution
- Faster iteration during development
- No build step needed
- JSDoc provides type hints anyway
Cons:
- No compile-time type checking
- Larger projects benefit from TS
For a library this size (~500 LOC), JavaScript with good JSDoc was sufficient:
/**
* Harvest elements from a page
* @param {string} url - Page URL
* @param {string} selector - CSS selector
* @param {Function} extractor - Extraction function
* @returns {Promise<Array>} Extracted data
*/
async harvest(url, selector, extractor) {
// ...
}
StandardJS for Linting
No configuration. Just install and run:
npm install standard --save-dev
{
"scripts": {
"lint": "standard",
"lint:fix": "standard --fix"
}
}
Zero debates about semicolons or spacing. More time coding.
Implementation Challenges
Challenge 1: Browser Lifecycle Management
Problem: Users might forget to close the browser
Solution: Explicit init/close pattern
class DOMHarvester {
async init() {
if (this.browser) {
throw new Error('Already initialized')
}
this.browser = await playwright.chromium.launch(...)
}
async close() {
await this.browser?.close()
this.browser = null
}
}
Also provided convenience function for one-off scrapes:
// Handles lifecycle automatically
export async function harvest(url, selector, extractor) {
const harvester = new DOMHarvester()
await harvester.init()
try {
return await harvester.harvest(url, selector, extractor)
} finally {
await harvester.close()
}
}
Challenge 2: Error Messages
Playwright errors can be cryptic:
Error: Target closed
Wrapped them with context:
try {
await page.goto(url, { waitUntil: 'networkidle' })
} catch (error) {
if (error.name === 'TimeoutError') {
throw new Error(
`Failed to load ${url}: Page did not reach network idle state within 30s. ` +
`The site might be slow or blocking automated access.`
)
}
throw error
}
Challenge 3: Testing Against Real Pages
Unit tests with mocks weren't enough. I needed integration tests against real pages.
Solution: Created a fixture server
// test/fixtures/server.js
import express from 'express'
import { readFileSync } from 'fs'
const app = express()
app.get('/products', (req, res) => {
const html = readFileSync('./fixtures/products.html', 'utf-8')
res.send(html)
})
export function startServer() {
return new Promise(resolve => {
const server = app.listen(3000, () => resolve(server))
})
}
Now integration tests run against controlled HTML:
describe('harvest()', () => {
let server
before(async () => {
server = await startServer()
})
it('extracts products', async () => {
const products = await harvest(
'http://localhost:3000/products',
'.product',
extractProduct
)
expect(products).to.have.length(10)
expect(products[0].name).to.equal('Product 1')
})
after(() => server.close())
})
Publishing to npm
1. Package.json Setup
{
"name": "domharvest-playwright",
"version": "1.0.0",
"description": "Simple DOM harvesting with Playwright",
"main": "src/index.js",
"type": "module",
"engines": {
"node": ">=16.0.0"
},
"keywords": [
"web-scraping",
"playwright",
"dom",
"scraper",
"automation"
],
"files": [
"src/",
"README.md",
"LICENSE"
]
}
2. Semantic Versioning
Set up automated releases with conventional commits:
npm install --save-dev standard-version
{
"scripts": {
"release": "standard-version",
"release:minor": "standard-version --release-as minor",
"release:major": "standard-version --release-as major"
}
}
Now releases are automated:
git commit -m "feat: add custom evaluation support"
npm run release:minor # 1.0.0 → 1.1.0
git push --follow-tags
3. GitHub Actions for CI/CD
# .github/workflows/publish.yml
name: Publish to npm
on:
push:
tags:
- 'v*'
jobs:
publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: '18'
registry-url: 'https://registry.npmjs.org'
- run: npm ci
- run: npm test
- run: npm publish
env:
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
Push a tag, package publishes automatically.
Documentation with VitePress
Chose VitePress for docs:
npm install --save-dev vitepress
// docs/.vitepress/config.js
export default {
title: 'domharvest-playwright',
description: 'Simple DOM harvesting with Playwright',
themeConfig: {
nav: [
{ text: 'Guide', link: '/guide/' },
{ text: 'API', link: '/api/' },
{ text: 'GitHub', link: 'https://github.com/domharvest/domharvest-playwright' }
]
}
}
Deployed to GitHub Pages automatically.
Launch Strategy
- GitHub README - Comprehensive with examples
- npm package - Published with good keywords
- Dev.to article - This post!
- Reddit - r/javascript, r/webdev (non-spammy)
- Twitter/Mastodon - Announcement post
Reception & Feedback
First week results:
- 50+ GitHub stars
- 200+ npm downloads
- 5 issues opened (feature requests!)
- 2 pull requests
Unexpected use cases people found:
- SEO auditing tools
- Competitive price monitoring
- Content aggregation for newsletters
- QA automation testing
What I'd Do Differently
1. TypeScript from the start
As the project grew, I missed compile-time checks. Would use TS next time.
2. More examples in docs
Users wanted more real-world examples. Added them later based on issues.
3. Better error recovery
Initial version crashed on navigation timeouts. Should have retried automatically.
4. Telemetry (opt-in)
No idea how people actually use it. Anonymous usage stats would help prioritize features.
Lessons Learned
On Open Source
- Good docs > marketing - People found it organically through search
- Respond fast to issues - Contributors appreciate quick feedback
- Semver matters - Don't break APIs casually
- Examples are documentation - Code speaks louder than words
On API Design
- Start simple, add complexity later - Easy to add features, hard to remove them
-
Convenience functions matter -
harvest()vsnew DOMHarvester()- both are useful - Fail loudly - Confusing errors waste user time
On JavaScript Libraries
- Tree-shaking is hard without ESM - Export individual functions
- Peer dependencies are delicate - Let users control Playwright version
- Bundle size matters - Keep core small, extras optional
What's Next
Roadmap for v2:
- [ ] Retry logic - Auto-retry failed navigations
- [ ] Request interception - Block images/fonts for speed
- [ ] Stealth mode - Evade basic bot detection
- [ ] Parallel scraping - Scrape multiple URLs concurrently
- [ ] TypeScript rewrite - Better DX for TS users
Try It
npm install domharvest-playwright
import { harvest } from 'domharvest-playwright'
const quotes = await harvest(
'https://quotes.toscrape.com/',
'.quote',
(el) => ({
text: el.querySelector('.text')?.textContent,
author: el.querySelector('.author')?.textContent
})
)
console.log(quotes)
Links:
Building this taught me more about API design, testing, and open source than any tutorial could. If you're thinking about publishing a package, just do it. You'll learn by shipping.
Questions? Hit me up in the comments!
Top comments (0)