DEV Community

Cover image for Building domharvest-playwright: From Idea to npm Package
Max B.
Max B.

Posted on

Building domharvest-playwright: From Idea to npm Package

Six months ago, I was frustrated with existing web scraping tools. They were either too simple (Cheerio couldn't handle JavaScript) or too complex (raw Playwright had too much boilerplate). So I built domharvest-playwright.

Here's the complete story of how I went from idea to published npm package.

The Problem That Started It All

I needed to scrape product data from a heavily JavaScript-dependent e-commerce site. My options:

Cheerio: Fast but can't execute JavaScript

// Doesn't work on modern SPAs
const $ = cheerio.load(html)
$('.product').each(...)  // Empty - content loaded via JS!
Enter fullscreen mode Exit fullscreen mode

Puppeteer/Playwright: Powerful but verbose

// 20+ lines just to extract some divs
const browser = await playwright.chromium.launch()
const context = await browser.newContext()
const page = await context.newPage()
await page.goto(url)
const elements = await page.$$('.product')
// ...more boilerplate
await browser.close()
Enter fullscreen mode Exit fullscreen mode

I wanted something in between: JavaScript rendering + simple API.

Design Goals

Before writing code, I defined what "success" looked like:

  1. Simple API - Scrape data in 5 lines of code
  2. Handles JavaScript - Modern SPAs shouldn't be a problem
  3. Reliable - Works in production, not just demos
  4. Zero config - Smart defaults, optional customization
  5. Well-tested - I'm not debugging scraper bugs at 2 AM

The API Design Process

Iteration 1: Too Simple

// First attempt - too limited
const data = await scrape(url, '.product')
Enter fullscreen mode Exit fullscreen mode

Problem: No control over extraction logic

Iteration 2: Too Complex

// Second attempt - too much config
const harvester = new Harvester({
  browser: 'chromium',
  headless: true,
  timeout: 30000,
  waitUntil: 'networkidle',
  extractor: new Extractor({
    mode: 'selector',
    transform: true
  })
})
Enter fullscreen mode Exit fullscreen mode

Problem: Configuration hell before writing actual scraping code

Final Design: Just Right

// Final API - simple but flexible
import { harvest } from 'domharvest-playwright'

const products = await harvest(
  'https://example.com',
  '.product',
  (el) => ({
    name: el.querySelector('.name')?.textContent,
    price: el.querySelector('.price')?.textContent
  })
)
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • One function call for simple cases
  • Extractor function gives full control
  • Runs in browser context (fast)
  • Type-safe with JSDoc

Technical Decisions

Playwright Over Puppeteer

Initially considered Puppeteer, but Playwright won because:

  • Better API: More intuitive method names
  • Multi-browser: Chrome, Firefox, WebKit out of the box
  • Auto-wait: Built-in waiting for elements
  • Active development: Microsoft backing

JavaScript Over TypeScript

Controversial choice. Here's why:

Pros of sticking with JS:

  • Lower barrier to contribution
  • Faster iteration during development
  • No build step needed
  • JSDoc provides type hints anyway

Cons:

  • No compile-time type checking
  • Larger projects benefit from TS

For a library this size (~500 LOC), JavaScript with good JSDoc was sufficient:

/**
 * Harvest elements from a page
 * @param {string} url - Page URL
 * @param {string} selector - CSS selector
 * @param {Function} extractor - Extraction function
 * @returns {Promise<Array>} Extracted data
 */
async harvest(url, selector, extractor) {
  // ...
}
Enter fullscreen mode Exit fullscreen mode

StandardJS for Linting

No configuration. Just install and run:

npm install standard --save-dev
Enter fullscreen mode Exit fullscreen mode
{
  "scripts": {
    "lint": "standard",
    "lint:fix": "standard --fix"
  }
}
Enter fullscreen mode Exit fullscreen mode

Zero debates about semicolons or spacing. More time coding.

Implementation Challenges

Challenge 1: Browser Lifecycle Management

Problem: Users might forget to close the browser

Solution: Explicit init/close pattern

class DOMHarvester {
  async init() {
    if (this.browser) {
      throw new Error('Already initialized')
    }
    this.browser = await playwright.chromium.launch(...)
  }

  async close() {
    await this.browser?.close()
    this.browser = null
  }
}
Enter fullscreen mode Exit fullscreen mode

Also provided convenience function for one-off scrapes:

// Handles lifecycle automatically
export async function harvest(url, selector, extractor) {
  const harvester = new DOMHarvester()
  await harvester.init()
  try {
    return await harvester.harvest(url, selector, extractor)
  } finally {
    await harvester.close()
  }
}
Enter fullscreen mode Exit fullscreen mode

Challenge 2: Error Messages

Playwright errors can be cryptic:

Error: Target closed
Enter fullscreen mode Exit fullscreen mode

Wrapped them with context:

try {
  await page.goto(url, { waitUntil: 'networkidle' })
} catch (error) {
  if (error.name === 'TimeoutError') {
    throw new Error(
      `Failed to load ${url}: Page did not reach network idle state within 30s. ` +
      `The site might be slow or blocking automated access.`
    )
  }
  throw error
}
Enter fullscreen mode Exit fullscreen mode

Challenge 3: Testing Against Real Pages

Unit tests with mocks weren't enough. I needed integration tests against real pages.

Solution: Created a fixture server

// test/fixtures/server.js
import express from 'express'
import { readFileSync } from 'fs'

const app = express()

app.get('/products', (req, res) => {
  const html = readFileSync('./fixtures/products.html', 'utf-8')
  res.send(html)
})

export function startServer() {
  return new Promise(resolve => {
    const server = app.listen(3000, () => resolve(server))
  })
}
Enter fullscreen mode Exit fullscreen mode

Now integration tests run against controlled HTML:

describe('harvest()', () => {
  let server

  before(async () => {
    server = await startServer()
  })

  it('extracts products', async () => {
    const products = await harvest(
      'http://localhost:3000/products',
      '.product',
      extractProduct
    )

    expect(products).to.have.length(10)
    expect(products[0].name).to.equal('Product 1')
  })

  after(() => server.close())
})
Enter fullscreen mode Exit fullscreen mode

Publishing to npm

1. Package.json Setup

{
  "name": "domharvest-playwright",
  "version": "1.0.0",
  "description": "Simple DOM harvesting with Playwright",
  "main": "src/index.js",
  "type": "module",
  "engines": {
    "node": ">=16.0.0"
  },
  "keywords": [
    "web-scraping",
    "playwright",
    "dom",
    "scraper",
    "automation"
  ],
  "files": [
    "src/",
    "README.md",
    "LICENSE"
  ]
}
Enter fullscreen mode Exit fullscreen mode

2. Semantic Versioning

Set up automated releases with conventional commits:

npm install --save-dev standard-version
Enter fullscreen mode Exit fullscreen mode
{
  "scripts": {
    "release": "standard-version",
    "release:minor": "standard-version --release-as minor",
    "release:major": "standard-version --release-as major"
  }
}
Enter fullscreen mode Exit fullscreen mode

Now releases are automated:

git commit -m "feat: add custom evaluation support"
npm run release:minor  # 1.0.0 → 1.1.0
git push --follow-tags
Enter fullscreen mode Exit fullscreen mode

3. GitHub Actions for CI/CD

# .github/workflows/publish.yml
name: Publish to npm

on:
  push:
    tags:
      - 'v*'

jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
        with:
          node-version: '18'
          registry-url: 'https://registry.npmjs.org'

      - run: npm ci
      - run: npm test
      - run: npm publish
        env:
          NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
Enter fullscreen mode Exit fullscreen mode

Push a tag, package publishes automatically.

Documentation with VitePress

Chose VitePress for docs:

npm install --save-dev vitepress
Enter fullscreen mode Exit fullscreen mode
// docs/.vitepress/config.js
export default {
  title: 'domharvest-playwright',
  description: 'Simple DOM harvesting with Playwright',
  themeConfig: {
    nav: [
      { text: 'Guide', link: '/guide/' },
      { text: 'API', link: '/api/' },
      { text: 'GitHub', link: 'https://github.com/domharvest/domharvest-playwright' }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Deployed to GitHub Pages automatically.

Launch Strategy

  1. GitHub README - Comprehensive with examples
  2. npm package - Published with good keywords
  3. Dev.to article - This post!
  4. Reddit - r/javascript, r/webdev (non-spammy)
  5. Twitter/Mastodon - Announcement post

Reception & Feedback

First week results:

  • 50+ GitHub stars
  • 200+ npm downloads
  • 5 issues opened (feature requests!)
  • 2 pull requests

Unexpected use cases people found:

  • SEO auditing tools
  • Competitive price monitoring
  • Content aggregation for newsletters
  • QA automation testing

What I'd Do Differently

1. TypeScript from the start
As the project grew, I missed compile-time checks. Would use TS next time.

2. More examples in docs
Users wanted more real-world examples. Added them later based on issues.

3. Better error recovery
Initial version crashed on navigation timeouts. Should have retried automatically.

4. Telemetry (opt-in)
No idea how people actually use it. Anonymous usage stats would help prioritize features.

Lessons Learned

On Open Source

  • Good docs > marketing - People found it organically through search
  • Respond fast to issues - Contributors appreciate quick feedback
  • Semver matters - Don't break APIs casually
  • Examples are documentation - Code speaks louder than words

On API Design

  • Start simple, add complexity later - Easy to add features, hard to remove them
  • Convenience functions matter - harvest() vs new DOMHarvester() - both are useful
  • Fail loudly - Confusing errors waste user time

On JavaScript Libraries

  • Tree-shaking is hard without ESM - Export individual functions
  • Peer dependencies are delicate - Let users control Playwright version
  • Bundle size matters - Keep core small, extras optional

What's Next

Roadmap for v2:

  • [ ] Retry logic - Auto-retry failed navigations
  • [ ] Request interception - Block images/fonts for speed
  • [ ] Stealth mode - Evade basic bot detection
  • [ ] Parallel scraping - Scrape multiple URLs concurrently
  • [ ] TypeScript rewrite - Better DX for TS users

Try It

npm install domharvest-playwright
Enter fullscreen mode Exit fullscreen mode
import { harvest } from 'domharvest-playwright'

const quotes = await harvest(
  'https://quotes.toscrape.com/',
  '.quote',
  (el) => ({
    text: el.querySelector('.text')?.textContent,
    author: el.querySelector('.author')?.textContent
  })
)

console.log(quotes)
Enter fullscreen mode Exit fullscreen mode

Links:

Building this taught me more about API design, testing, and open source than any tutorial could. If you're thinking about publishing a package, just do it. You'll learn by shipping.

Questions? Hit me up in the comments!

Top comments (0)