DEV Community

Anup Karanjkar
Anup Karanjkar

Posted on • Originally published at wowhow.cloud

Web Scraping with Node.js: Puppeteer vs Cheerio (Complete 2026 Guide)

Web scraping is one of those skills that looks simple until you hit your first JavaScript-rendered page, your first CAPTCHA, or your first IP ban. Cheerio handles 70% of scraping tasks with almost zero overhead. Puppeteer handles the rest. This guide covers both in depth — when to use each, how to build scrapers that survive real sites, and how to extract structured data reliably at scale.

If you are building data pipelines or automation workflows, check out the WOWHOW Tools library for pre-built utilities, and browse the full catalog for data engineering starter kits.

Cheerio: Fast Static Scraping

Cheerio loads raw HTML and gives you a jQuery-like API to query it. No browser, no JavaScript execution — just a DOM parser. It is 10–50x faster than Puppeteer for static pages and uses a fraction of the memory.

import * as cheerio from 'cheerio'

async function scrapeHackerNews(url: string) {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; research-bot/1.0)',
    },
  })
  const html = await response.text()
  const $ = cheerio.load(html)

  const stories: Array = []

  $('.athing').each((i, el) => {
    const rank = parseInt($(el).find('.rank').text().replace('.', ''), 10)
    const titleEl = $(el).find('.titleline > a').first()
    const title = titleEl.text().trim()
    const href = titleEl.attr('href') ?? ''

    // points are in the NEXT sibling row
    const subtext = $(el).next('.spacer').next()
    const points = parseInt(subtext.find('.score').text(), 10) || 0

    stories.push({ rank, title, url: href, points })
  })

  return stories
}
Enter fullscreen mode Exit fullscreen mode

Cheerio Selector Patterns

Mastering selectors is the key skill. Here are the patterns you will use most:

import * as cheerio from 'cheerio'

function demonstrateSelectors(html: string) {
  const $ = cheerio.load(html)

  // --- attribute selectors
  const dataLinks = $('a[data-track]')           // has attribute
  const exactMatch = $('input[type="submit"]')   // exact value
  const contains   = $('[class*="product-card"]') // contains substring
  const startsWith = $('[href^="https://"]')      // starts with

  // --- positional pseudo-selectors
  const firstCard  = $('.product-card').first()
  const lastCard   = $('.product-card').last()
  const thirdCard  = $('.product-card').eq(2)    // 0-indexed
  const evenRows   = $('tr:nth-child(even)')

  // --- traversal
  const parent       = $('.price').parent()
  const closestForm  = $('input').closest('form')
  const siblings     = $('h2').siblings('p')
  const nextSibling  = $('.title').next()
  const allChildren  = $('.card').children()

  // --- extracting data
  const text    = $('h1').text().trim()
  const html2   = $('.description').html()        // inner HTML as string
  const href    = $('a.cta').attr('href')
  const allHrefs = $('a').map((_, el) => $(el).attr('href')).get()

  // --- filtering
  const nonEmpty = $('p').filter((_, el) => $(el).text().trim().length > 0)
  const withPrice = $('.item').filter((_, el) => $(el).find('.price').length > 0)

  return { text, href, allHrefs }
}
Enter fullscreen mode Exit fullscreen mode

Puppeteer: Dynamic Page Scraping

Use Puppeteer when the content you need is rendered by JavaScript after page load — single-page apps, infinite scroll, lazy-loaded images, login-gated content.

import puppeteer, { type Browser, type Page } from 'puppeteer'

async function launchBrowser(): Promise {
  return puppeteer.launch({
    headless: true,
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-gpu',
    ],
  })
}

async function scrapeDynamicPage(url: string): Promise {
  const browser = await launchBrowser()
  const page = await browser.newPage()

  try {
    // block images/fonts to speed up loads
    await page.setRequestInterception(true)
    page.on('request', (req) => {
      if (['image', 'font', 'media'].includes(req.resourceType())) {
        req.abort()
      } else {
        req.continue()
      }
    })

    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30_000 })

    // wait for a specific element before extracting
    await page.waitForSelector('.product-list', { timeout: 10_000 })

    // extract data in page context (runs in browser)
    const items = await page.evaluate(() => {
      return Array.from(document.querySelectorAll('.product-card')).map((el) => ({
        title: el.querySelector('.title')?.textContent?.trim() ?? '',
        price: el.querySelector('.price')?.textContent?.trim() ?? '',
        imageUrl: (el.querySelector('img') as HTMLImageElement | null)?.src ?? '',
      }))
    })

    return items
  } finally {
    await browser.close()
  }
}
Enter fullscreen mode Exit fullscreen mode

Handling Infinite Scroll

async function scrapeInfiniteScroll(page: Page): Promise {
  const results: string[] = []
  let previousHeight = 0

  while (true) {
    // extract items currently visible
    const newItems = await page.evaluate(() =>
      Array.from(document.querySelectorAll('.item-title')).map(
        (el) => el.textContent?.trim() ?? ''
      )
    )
    results.push(...newItems.slice(results.length))

    // scroll to bottom
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight))
    await new Promise((r) => setTimeout(r, 1500)) // wait for load

    const newHeight = await page.evaluate(() => document.body.scrollHeight)
    if (newHeight === previousHeight) break // no more content
    previousHeight = newHeight
  }

  return [...new Set(results)] // deduplicate
}
Enter fullscreen mode Exit fullscreen mode

Rate Limiting — The Most Important Part

Scraping without rate limiting will get you banned. Treat rate limiting as mandatory, not optional.

class RateLimiter {
  private queue: Array void> = []
  private running = 0

  constructor(
    private readonly maxConcurrent: number,
    private readonly minDelayMs: number,
    private readonly maxDelayMs: number
  ) {}

  async acquire(): Promise void> {
    if (this.running >= this.maxConcurrent) {
      await new Promise((resolve) => this.queue.push(resolve))
    }
    this.running++

    // jitter: random delay between min and max
    const delay =
      this.minDelayMs + Math.random() * (this.maxDelayMs - this.minDelayMs)
    await new Promise((r) => setTimeout(r, delay))

    return () => {
      this.running--
      this.queue.shift()?.()
    }
  }
}

// usage
const limiter = new RateLimiter(2, 1000, 3000) // max 2 concurrent, 1-3s delay

async function fetchWithRateLimit(url: string): Promise {
  const release = await limiter.acquire()
  try {
    const res = await fetch(url)
    if (!res.ok) throw new Error(`HTTP \${res.status}`)
    return res.text()
  } finally {
    release()
  }
}
Enter fullscreen mode Exit fullscreen mode

Proxy Rotation

const proxies = [
  'http://proxy1.example.com:8080',
  'http://proxy2.example.com:8080',
  'http://proxy3.example.com:8080',
]

function getProxy(): string {
  return proxies[Math.floor(Math.random() * proxies.length)]
}

// with Puppeteer
async function launchWithProxy(proxy: string): Promise {
  return puppeteer.launch({
    headless: true,
    args: [`--proxy-server=\${proxy}`, '--no-sandbox'],
  })
}

// with fetch (via undici or https-proxy-agent)
import { ProxyAgent, fetch as undiciFetch } from 'undici'

async function fetchViaProxy(url: string): Promise {
  const proxy = getProxy()
  const dispatcher = new ProxyAgent(proxy)
  const res = await undiciFetch(url, { dispatcher })
  return res.text()
}
Enter fullscreen mode Exit fullscreen mode

Retry Logic with Exponential Backoff

async function withRetry(
  fn: () => Promise,
  maxAttempts = 3,
  baseDelayMs = 1000
): Promise {
  for (let attempt = 1; attempt  setTimeout(r, delay + jitter))
    }
  }
  throw new Error('unreachable')
}

// usage
const html = await withRetry(() => fetchWithRateLimit(url))
Enter fullscreen mode Exit fullscreen mode

Structured Data Extraction with Zod Validation

import { z } from 'zod'
import * as cheerio from 'cheerio'

const ProductSchema = z.object({
  id: z.string(),
  title: z.string().min(1),
  price: z.number().positive(),
  currency: z.string().length(3),
  inStock: z.boolean(),
  imageUrl: z.string().url().nullable(),
})

type Product = z.infer

function extractProduct(el: cheerio.Cheerio, $: cheerio.CheerioAPI): Product {
  const priceText = $(el).find('[data-price]').attr('data-price') ?? '0'

  const raw = {
    id: $(el).attr('data-product-id') ?? '',
    title: $(el).find('.product-title').text().trim(),
    price: parseFloat(priceText),
    currency: $(el).find('[data-currency]').attr('data-currency') ?? 'USD',
    inStock: $(el).find('.stock-badge').text().includes('In Stock'),
    imageUrl: $(el).find('img').attr('src') ?? null,
  }

  return ProductSchema.parse(raw) // throws ZodError if invalid
}
Enter fullscreen mode Exit fullscreen mode

Choosing Between Puppeteer and Cheerio

Use Cheerio when: the HTML you need is in the initial response body, you need speed, or you are running many concurrent requests. Use Puppeteer when: content is loaded by JavaScript after initial render, you need to interact with the page (click, scroll, fill forms), or you need to handle authentication flows.

A common pattern is to use Cheerio first, and fall back to Puppeteer only when the Cheerio output is empty — this gives you fast paths for static pages and automatic fallback for dynamic ones.

People Also Ask

Is web scraping legal?

It depends on the site's terms of service, what data you are collecting, and your jurisdiction. Publicly available data with no ToS prohibition is generally safe. Scraping behind authentication, storing personal data, or violating a site's ToS can create legal exposure. Always check robots.txt and the site's ToS before scraping commercially.

How do I handle CAPTCHAs in Puppeteer?

For light usage, human-in-the-loop (pause and prompt the user to solve). For automation, CAPTCHA solving services like 2Captcha or CapSolver provide APIs that return solved tokens. Some CAPTCHAs can be avoided entirely by using real browser fingerprints, Puppeteer Stealth, and appropriate rate limiting that mimics human behaviour.

What is the difference between waitUntil: 'networkidle0' and 'networkidle2'?

networkidle0 waits until there are zero network connections for 500ms — good for pages that stop all requests when fully loaded. networkidle2 waits until there are at most 2 connections for 500ms — better for pages that maintain persistent WebSocket or polling connections. Use networkidle2 as the default and only switch to networkidle0 if you know the page fully quiesces.

Originally published at wowhow.cloud

Top comments (0)