Anup Karanjkar

Posted on May 21 • Originally published at wowhow.cloud

Web Scraping with Node.js: Puppeteer vs Cheerio (Complete 2026 Guide)

#webscraping #puppeteervs #nodejsscraping #puppeteerscraping

Web scraping is one of those skills that looks simple until you hit your first JavaScript-rendered page, your first CAPTCHA, or your first IP ban. Cheerio handles 70% of scraping tasks with almost zero overhead. Puppeteer handles the rest. This guide covers both in depth — when to use each, how to build scrapers that survive real sites, and how to extract structured data reliably at scale.

If you are building data pipelines or automation workflows, check out the WOWHOW Tools library for pre-built utilities, and browse the full catalog for data engineering starter kits.

Cheerio: Fast Static Scraping

Cheerio loads raw HTML and gives you a jQuery-like API to query it. No browser, no JavaScript execution — just a DOM parser. It is 10–50x faster than Puppeteer for static pages and uses a fraction of the memory.

import * as cheerio from 'cheerio'

async function scrapeHackerNews(url: string) {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; research-bot/1.0)',
    },
  })
  const html = await response.text()
  const $ = cheerio.load(html)

  const stories: Array = []

  $('.athing').each((i, el) => {
    const rank = parseInt($(el).find('.rank').text().replace('.', ''), 10)
    const titleEl = $(el).find('.titleline > a').first()
    const title = titleEl.text().trim()
    const href = titleEl.attr('href') ?? ''

    // points are in the NEXT sibling row
    const subtext = $(el).next('.spacer').next()
    const points = parseInt(subtext.find('.score').text(), 10) || 0

    stories.push({ rank, title, url: href, points })
  })

  return stories
}

Cheerio Selector Patterns

Mastering selectors is the key skill. Here are the patterns you will use most:

import * as cheerio from 'cheerio'

function demonstrateSelectors(html: string) {
  const $ = cheerio.load(html)

  // --- attribute selectors
  const dataLinks = $('a[data-track]')           // has attribute
  const exactMatch = $('input[type="submit"]')   // exact value
  const contains   = $('[class*="product-card"]') // contains substring
  const startsWith = $('[href^="https://"]')      // starts with

  // --- positional pseudo-selectors
  const firstCard  = $('.product-card').first()
  const lastCard   = $('.product-card').last()
  const thirdCard  = $('.product-card').eq(2)    // 0-indexed
  const evenRows   = $('tr:nth-child(even)')

  // --- traversal
  const parent       = $('.price').parent()
  const closestForm  = $('input').closest('form')
  const siblings     = $('h2').siblings('p')
  const nextSibling  = $('.title').next()
  const allChildren  = $('.card').children()

  // --- extracting data
  const text    = $('h1').text().trim()
  const html2   = $('.description').html()        // inner HTML as string
  const href    = $('a.cta').attr('href')
  const allHrefs = $('a').map((_, el) => $(el).attr('href')).get()

  // --- filtering
  const nonEmpty = $('p').filter((_, el) => $(el).text().trim().length > 0)
  const withPrice = $('.item').filter((_, el) => $(el).find('.price').length > 0)

  return { text, href, allHrefs }
}

Puppeteer: Dynamic Page Scraping

Use Puppeteer when the content you need is rendered by JavaScript after page load — single-page apps, infinite scroll, lazy-loaded images, login-gated content.

import puppeteer, { type Browser, type Page } from 'puppeteer'

async function launchBrowser(): Promise {
  return puppeteer.launch({
    headless: true,
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-gpu',
    ],
  })
}

async function scrapeDynamicPage(url: string): Promise {
  const browser = await launchBrowser()
  const page = await browser.newPage()

  try {
    // block images/fonts to speed up loads
    await page.setRequestInterception(true)
    page.on('request', (req) => {
      if (['image', 'font', 'media'].includes(req.resourceType())) {
        req.abort()
      } else {
        req.continue()
      }
    })

    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30_000 })

    // wait for a specific element before extracting
    await page.waitForSelector('.product-list', { timeout: 10_000 })

    // extract data in page context (runs in browser)
    const items = await page.evaluate(() => {
      return Array.from(document.querySelectorAll('.product-card')).map((el) => ({
        title: el.querySelector('.title')?.textContent?.trim() ?? '',
        price: el.querySelector('.price')?.textContent?.trim() ?? '',
        imageUrl: (el.querySelector('img') as HTMLImageElement | null)?.src ?? '',
      }))
    })

    return items
  } finally {
    await browser.close()
  }
}

Handling Infinite Scroll

async function scrapeInfiniteScroll(page: Page): Promise {
  const results: string[] = []
  let previousHeight = 0

  while (true) {
    // extract items currently visible
    const newItems = await page.evaluate(() =>
      Array.from(document.querySelectorAll('.item-title')).map(
        (el) => el.textContent?.trim() ?? ''
      )
    )
    results.push(...newItems.slice(results.length))

    // scroll to bottom
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight))
    await new Promise((r) => setTimeout(r, 1500)) // wait for load

    const newHeight = await page.evaluate(() => document.body.scrollHeight)
    if (newHeight === previousHeight) break // no more content
    previousHeight = newHeight
  }

  return [...new Set(results)] // deduplicate
}

Rate Limiting — The Most Important Part

Scraping without rate limiting will get you banned. Treat rate limiting as mandatory, not optional.

class RateLimiter {
  private queue: Array void> = []
  private running = 0

  constructor(
    private readonly maxConcurrent: number,
    private readonly minDelayMs: number,
    private readonly maxDelayMs: number
  ) {}

  async acquire(): Promise void> {
    if (this.running >= this.maxConcurrent) {
      await new Promise((resolve) => this.queue.push(resolve))
    }
    this.running++

    // jitter: random delay between min and max
    const delay =
      this.minDelayMs + Math.random() * (this.maxDelayMs - this.minDelayMs)
    await new Promise((r) => setTimeout(r, delay))

    return () => {
      this.running--
      this.queue.shift()?.()
    }
  }
}

// usage
const limiter = new RateLimiter(2, 1000, 3000) // max 2 concurrent, 1-3s delay

async function fetchWithRateLimit(url: string): Promise {
  const release = await limiter.acquire()
  try {
    const res = await fetch(url)
    if (!res.ok) throw new Error(`HTTP \${res.status}`)
    return res.text()
  } finally {
    release()
  }
}

Proxy Rotation

const proxies = [
  'http://proxy1.example.com:8080',
  'http://proxy2.example.com:8080',
  'http://proxy3.example.com:8080',
]

function getProxy(): string {
  return proxies[Math.floor(Math.random() * proxies.length)]
}

// with Puppeteer
async function launchWithProxy(proxy: string): Promise {
  return puppeteer.launch({
    headless: true,
    args: [`--proxy-server=\${proxy}`, '--no-sandbox'],
  })
}

// with fetch (via undici or https-proxy-agent)
import { ProxyAgent, fetch as undiciFetch } from 'undici'

async function fetchViaProxy(url: string): Promise {
  const proxy = getProxy()
  const dispatcher = new ProxyAgent(proxy)
  const res = await undiciFetch(url, { dispatcher })
  return res.text()
}

Retry Logic with Exponential Backoff

async function withRetry(
  fn: () => Promise,
  maxAttempts = 3,
  baseDelayMs = 1000
): Promise {
  for (let attempt = 1; attempt  setTimeout(r, delay + jitter))
    }
  }
  throw new Error('unreachable')
}

// usage
const html = await withRetry(() => fetchWithRateLimit(url))

Structured Data Extraction with Zod Validation

import { z } from 'zod'
import * as cheerio from 'cheerio'

const ProductSchema = z.object({
  id: z.string(),
  title: z.string().min(1),
  price: z.number().positive(),
  currency: z.string().length(3),
  inStock: z.boolean(),
  imageUrl: z.string().url().nullable(),
})

type Product = z.infer

function extractProduct(el: cheerio.Cheerio, $: cheerio.CheerioAPI): Product {
  const priceText = $(el).find('[data-price]').attr('data-price') ?? '0'

  const raw = {
    id: $(el).attr('data-product-id') ?? '',
    title: $(el).find('.product-title').text().trim(),
    price: parseFloat(priceText),
    currency: $(el).find('[data-currency]').attr('data-currency') ?? 'USD',
    inStock: $(el).find('.stock-badge').text().includes('In Stock'),
    imageUrl: $(el).find('img').attr('src') ?? null,
  }

  return ProductSchema.parse(raw) // throws ZodError if invalid
}

Choosing Between Puppeteer and Cheerio

Use Cheerio when: the HTML you need is in the initial response body, you need speed, or you are running many concurrent requests. Use Puppeteer when: content is loaded by JavaScript after initial render, you need to interact with the page (click, scroll, fill forms), or you need to handle authentication flows.

A common pattern is to use Cheerio first, and fall back to Puppeteer only when the Cheerio output is empty — this gives you fast paths for static pages and automatic fallback for dynamic ones.

`networkidle0` waits until there are zero network connections for 500ms — good for pages that stop all requests when fully loaded. `networkidle2` waits until there are at most 2 connections for 500ms — better for pages that maintain persistent WebSocket or polling connections. Use `networkidle2` as the default and only switch to `networkidle0` if you know the page fully quiesces.

Originally published at wowhow.cloud

DEV Community

Web Scraping with Node.js: Puppeteer vs Cheerio (Complete 2026 Guide)

Cheerio: Fast Static Scraping

Cheerio Selector Patterns

Puppeteer: Dynamic Page Scraping

Handling Infinite Scroll

Rate Limiting — The Most Important Part

Proxy Rotation

Retry Logic with Exponential Backoff

Structured Data Extraction with Zod Validation

Choosing Between Puppeteer and Cheerio

People Also Ask

Is web scraping legal?

How do I handle CAPTCHAs in Puppeteer?

What is the difference between `waitUntil: 'networkidle0'` and `'networkidle2'`?

Top comments (0)

Cheerio: Fast Static Scraping

Cheerio Selector Patterns

Puppeteer: Dynamic Page Scraping

Handling Infinite Scroll

Rate Limiting — The Most Important Part

Proxy Rotation

Retry Logic with Exponential Backoff

Structured Data Extraction with Zod Validation

Choosing Between Puppeteer and Cheerio

People Also Ask

Is web scraping legal?

How do I handle CAPTCHAs in Puppeteer?

What is the difference between waitUntil: 'networkidle0' and 'networkidle2'?

What is the difference between `waitUntil: 'networkidle0'` and `'networkidle2'`?