Max B.

Posted on Jan 11 • Edited on Jan 16 • Originally published at blog.domharvest.dev

Building domharvest-playwright: From Idea to npm Package

#opensource #javascript #node #playwright

Six months ago, I was frustrated with existing web scraping tools. They were either too simple (Cheerio couldn't handle JavaScript) or too complex (raw Playwright had too much boilerplate). So I built domharvest-playwright.

Here's the complete story of how I went from idea to published npm package.

The Problem That Started It All

I needed to scrape product data from a heavily JavaScript-dependent e-commerce site. My options:

Cheerio: Fast but can't execute JavaScript

// Doesn't work on modern SPAs
const $ = cheerio.load(html)
$('.product').each(...)  // Empty - content loaded via JS!

Puppeteer/Playwright: Powerful but verbose

// 20+ lines just to extract some divs
const browser = await playwright.chromium.launch()
const context = await browser.newContext()
const page = await context.newPage()
await page.goto(url)
const elements = await page.$$('.product')
// ...more boilerplate
await browser.close()

I wanted something in between: JavaScript rendering + simple API.

Design Goals

Before writing code, I defined what "success" looked like:

Simple API - Scrape data in 5 lines of code
Handles JavaScript - Modern SPAs shouldn't be a problem
Reliable - Works in production, not just demos
Zero config - Smart defaults, optional customization
Well-tested - I'm not debugging scraper bugs at 2 AM

The API Design Process

Iteration 1: Too Simple

// First attempt - too limited
const data = await scrape(url, '.product')

Problem: No control over extraction logic

Iteration 2: Too Complex

// Second attempt - too much config
const harvester = new Harvester({
  browser: 'chromium',
  headless: true,
  timeout: 30000,
  waitUntil: 'networkidle',
  extractor: new Extractor({
    mode: 'selector',
    transform: true
  })
})

Problem: Configuration hell before writing actual scraping code

Final Design: Just Right

// Final API - simple but flexible
import { harvest } from 'domharvest-playwright'

const products = await harvest(
  'https://example.com',
  '.product',
  (el) => ({
    name: el.querySelector('.name')?.textContent,
    price: el.querySelector('.price')?.textContent
  })
)

Why this works:

One function call for simple cases
Extractor function gives full control
Runs in browser context (fast)
Type-safe with JSDoc

Technical Decisions

Playwright Over Puppeteer

Initially considered Puppeteer, but Playwright won because:

Better API: More intuitive method names
Multi-browser: Chrome, Firefox, WebKit out of the box
Auto-wait: Built-in waiting for elements
Active development: Microsoft backing

JavaScript Over TypeScript

Controversial choice. Here's why:

Pros of sticking with JS:

Lower barrier to contribution
Faster iteration during development
No build step needed
JSDoc provides type hints anyway

Cons:

No compile-time type checking
Larger projects benefit from TS

For a library this size (~500 LOC), JavaScript with good JSDoc was sufficient:

/**
 * Harvest elements from a page
 * @param {string} url - Page URL
 * @param {string} selector - CSS selector
 * @param {Function} extractor - Extraction function
 * @returns {Promise<Array>} Extracted data
 */
async harvest(url, selector, extractor) {
  // ...
}

StandardJS for Linting

No configuration. Just install and run:

npm install standard --save-dev

{
  "scripts": {
    "lint": "standard",
    "lint:fix": "standard --fix"
  }
}

Zero debates about semicolons or spacing. More time coding.

Implementation Challenges

Challenge 1: Browser Lifecycle Management

Problem: Users might forget to close the browser

Solution: Explicit init/close pattern

class DOMHarvester {
  async init() {
    if (this.browser) {
      throw new Error('Already initialized')
    }
    this.browser = await playwright.chromium.launch(...)
  }

  async close() {
    await this.browser?.close()
    this.browser = null
  }
}

Also provided convenience function for one-off scrapes:

// Handles lifecycle automatically
export async function harvest(url, selector, extractor) {
  const harvester = new DOMHarvester()
  await harvester.init()
  try {
    return await harvester.harvest(url, selector, extractor)
  } finally {
    await harvester.close()
  }
}

Challenge 2: Error Messages

Playwright errors can be cryptic:

Error: Target closed

Wrapped them with context:

try {
  await page.goto(url, { waitUntil: 'networkidle' })
} catch (error) {
  if (error.name === 'TimeoutError') {
    throw new Error(
      `Failed to load ${url}: Page did not reach network idle state within 30s. ` +
      `The site might be slow or blocking automated access.`
    )
  }
  throw error
}

Challenge 3: Testing Against Real Pages

Unit tests with mocks weren't enough. I needed integration tests against real pages.

Solution: Created a fixture server

// test/fixtures/server.js
import express from 'express'
import { readFileSync } from 'fs'

const app = express()

app.get('/products', (req, res) => {
  const html = readFileSync('./fixtures/products.html', 'utf-8')
  res.send(html)
})

export function startServer() {
  return new Promise(resolve => {
    const server = app.listen(3000, () => resolve(server))
  })
}

Now integration tests run against controlled HTML:

describe('harvest()', () => {
  let server

  before(async () => {
    server = await startServer()
  })

  it('extracts products', async () => {
    const products = await harvest(
      'http://localhost:3000/products',
      '.product',
      extractProduct
    )

    expect(products).to.have.length(10)
    expect(products[0].name).to.equal('Product 1')
  })

  after(() => server.close())
})

Publishing to npm

1. Package.json Setup

{
  "name": "domharvest-playwright",
  "version": "1.0.0",
  "description": "Simple DOM harvesting with Playwright",
  "main": "src/index.js",
  "type": "module",
  "engines": {
    "node": ">=16.0.0"
  },
  "keywords": [
    "web-scraping",
    "playwright",
    "dom",
    "scraper",
    "automation"
  ],
  "files": [
    "src/",
    "README.md",
    "LICENSE"
  ]
}

2. Semantic Versioning

Set up automated releases with conventional commits:

npm install --save-dev standard-version

{
  "scripts": {
    "release": "standard-version",
    "release:minor": "standard-version --release-as minor",
    "release:major": "standard-version --release-as major"
  }
}

Now releases are automated:

git commit -m "feat: add custom evaluation support"
npm run release:minor  # 1.0.0 → 1.1.0
git push --follow-tags

3. GitHub Actions for CI/CD

# .github/workflows/publish.yml
name: Publish to npm

on:
  push:
    tags:
      - 'v*'

jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
        with:
          node-version: '18'
          registry-url: 'https://registry.npmjs.org'

      - run: npm ci
      - run: npm test
      - run: npm publish
        env:
          NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}

Push a tag, package publishes automatically.

Documentation with VitePress

Chose VitePress for docs:

npm install --save-dev vitepress

// docs/.vitepress/config.js
export default {
  title: 'domharvest-playwright',
  description: 'Simple DOM harvesting with Playwright',
  themeConfig: {
    nav: [
      { text: 'Guide', link: '/guide/' },
      { text: 'API', link: '/api/' },
      { text: 'GitHub', link: 'https://github.com/domharvest/domharvest-playwright' }
    ]
  }
}

Deployed to GitHub Pages automatically.

Launch Strategy

GitHub README - Comprehensive with examples
npm package - Published with good keywords
Dev.to article - This post!
Reddit - r/javascript, r/webdev (non-spammy)
Twitter/Mastodon - Announcement post

Reception & Feedback

First week results:

50+ GitHub stars
200+ npm downloads
5 issues opened (feature requests!)
2 pull requests

Unexpected use cases people found:

SEO auditing tools
Competitive price monitoring
Content aggregation for newsletters
QA automation testing

What I'd Do Differently

1. TypeScript from the start
As the project grew, I missed compile-time checks. Would use TS next time.

2. More examples in docs
Users wanted more real-world examples. Added them later based on issues.

3. Better error recovery
Initial version crashed on navigation timeouts. Should have retried automatically.

4. Telemetry (opt-in)
No idea how people actually use it. Anonymous usage stats would help prioritize features.

Lessons Learned

On Open Source

Good docs > marketing - People found it organically through search
Respond fast to issues - Contributors appreciate quick feedback
Semver matters - Don't break APIs casually
Examples are documentation - Code speaks louder than words

On API Design

Start simple, add complexity later - Easy to add features, hard to remove them
Convenience functions matter - harvest() vs new DOMHarvester() - both are useful
Fail loudly - Confusing errors waste user time

On JavaScript Libraries

Tree-shaking is hard without ESM - Export individual functions
Peer dependencies are delicate - Let users control Playwright version
Bundle size matters - Keep core small, extras optional

What's Next

Roadmap for v2:

[ ] Retry logic - Auto-retry failed navigations
[ ] Request interception - Block images/fonts for speed
[ ] Stealth mode - Evade basic bot detection
[ ] Parallel scraping - Scrape multiple URLs concurrently
[ ] TypeScript rewrite - Better DX for TS users

Try It

npm install domharvest-playwright

import { harvest } from 'domharvest-playwright'

const quotes = await harvest(
  'https://quotes.toscrape.com/',
  '.quote',
  (el) => ({
    text: el.querySelector('.text')?.textContent,
    author: el.querySelector('.author')?.textContent
  })
)

console.log(quotes)

Links:

Building this taught me more about API design, testing, and open source than any tutorial could. If you're thinking about publishing a package, just do it. You'll learn by shipping.

Questions? Hit me up in the comments!

DEV Community