Web Scraping with Node.js: Cheerio, Puppeteer, and Playwright
Node.js has become a powerhouse for web scraping. This guide compares the three major tools — Cheerio, Puppeteer, and Playwright — with practical examples for each.
When to Use What
| Tool | Best For | Speed | JS Rendering |
|---|---|---|---|
| Cheerio | Static HTML parsing | Fastest | No |
| Puppeteer | Chrome automation | Medium | Yes |
| Playwright | Multi-browser testing | Medium | Yes |
Setup
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Cheerio: Fast HTML Parsing
Cheerio is jQuery for the server. It parses static HTML without a browser.
const axios = require("axios");
const cheerio = require("cheerio");
async function scrapeWithCheerio(url) {
const { data } = await axios.get(url, {
headers: {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
});
const $ = cheerio.load(data);
const results = [];
$("article.post").each((i, el) => {
results.push({
title: $(el).find("h2").text().trim(),
link: $(el).find("a").attr("href"),
summary: $(el).find(".summary").text().trim(),
date: $(el).find("time").attr("datetime")
});
});
return results;
}
// Usage
const posts = await scrapeWithCheerio("https://example-blog.com");
console.log(`Found ${posts.length} posts`);
When Cheerio Falls Short
Cheerio cannot execute JavaScript. If the page loads content dynamically (SPAs, infinite scroll, lazy loading), you need a browser-based tool.
Puppeteer: Chrome Automation
Puppeteer controls a headless Chrome browser — perfect for JS-heavy sites.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Playwright: Multi-Browser Power
Playwright supports Chrome, Firefox, and Safari with a cleaner API.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Building a Production Scraper
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Scaling Node.js Scrapers
For production scraping:
- ScraperAPI — proxy rotation and CAPTCHA solving, works with all three tools
- ThorData — residential proxies for sites that block datacenter IPs
- ScrapeOps — monitoring dashboard for your scraping pipeline
Performance Tips
- Use Cheerio when possible — 10x faster than browser-based scraping
- Block unnecessary resources in Puppeteer/Playwright
- Reuse browser instances instead of launching new ones
- Use connection pooling for concurrent requests
- Implement retry logic with exponential backoff
Conclusion
Pick the right tool for the job: Cheerio for static pages, Puppeteer for Chrome-specific needs, Playwright for multi-browser support. Combine them in a smart scraper that adapts to each target site.
Follow for more Node.js scraping tutorials!
Top comments (0)