DEV Community

agenthustler
agenthustler

Posted on • Edited on

Web Scraping with Node.js: Cheerio, Puppeteer, and Playwright

Web Scraping with Node.js: Cheerio, Puppeteer, and Playwright

Node.js has become a powerhouse for web scraping. This guide compares the three major tools — Cheerio, Puppeteer, and Playwright — with practical examples for each.

When to Use What

Tool Best For Speed JS Rendering
Cheerio Static HTML parsing Fastest No
Puppeteer Chrome automation Medium Yes
Playwright Multi-browser testing Medium Yes

Setup

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Cheerio: Fast HTML Parsing

Cheerio is jQuery for the server. It parses static HTML without a browser.

const axios = require("axios");
const cheerio = require("cheerio");

async function scrapeWithCheerio(url) {
  const { data } = await axios.get(url, {
    headers: {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    }
  });

  const $ = cheerio.load(data);
  const results = [];

  $("article.post").each((i, el) => {
    results.push({
      title: $(el).find("h2").text().trim(),
      link: $(el).find("a").attr("href"),
      summary: $(el).find(".summary").text().trim(),
      date: $(el).find("time").attr("datetime")
    });
  });

  return results;
}

// Usage
const posts = await scrapeWithCheerio("https://example-blog.com");
console.log(`Found ${posts.length} posts`);
Enter fullscreen mode Exit fullscreen mode

When Cheerio Falls Short

Cheerio cannot execute JavaScript. If the page loads content dynamically (SPAs, infinite scroll, lazy loading), you need a browser-based tool.

Puppeteer: Chrome Automation

Puppeteer controls a headless Chrome browser — perfect for JS-heavy sites.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Playwright: Multi-Browser Power

Playwright supports Chrome, Firefox, and Safari with a cleaner API.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Building a Production Scraper

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Scaling Node.js Scrapers

For production scraping:

  • ScraperAPI — proxy rotation and CAPTCHA solving, works with all three tools
  • ThorData — residential proxies for sites that block datacenter IPs
  • ScrapeOps — monitoring dashboard for your scraping pipeline

Performance Tips

  1. Use Cheerio when possible — 10x faster than browser-based scraping
  2. Block unnecessary resources in Puppeteer/Playwright
  3. Reuse browser instances instead of launching new ones
  4. Use connection pooling for concurrent requests
  5. Implement retry logic with exponential backoff

Conclusion

Pick the right tool for the job: Cheerio for static pages, Puppeteer for Chrome-specific needs, Playwright for multi-browser support. Combine them in a smart scraper that adapts to each target site.


Follow for more Node.js scraping tutorials!

Top comments (0)