Crawlee is a web scraping and browser automation framework for Node.js. Built by Apify, it handles proxies, retries, fingerprints, sessions, and storage — so you focus on data extraction, not infrastructure.
Why Crawlee?
- Anti-blocking — automatic proxy rotation, fingerprint randomization
- Smart retries — exponential backoff, session rotation
- Multiple crawlers — Cheerio (fast HTTP), Playwright (browser), Puppeteer
- Storage — datasets, key-value stores, request queues built-in
- Apify-ready — deploy to Apify Cloud with one command
Quick Start
npx crawlee create my-scraper
cd my-scraper
npm start
Cheerio Crawler (Fast HTTP Scraping)
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const title = $('h1').text();
const price = $('.price').text();
const description = $('.description').text();
await Dataset.pushData({
url: request.url,
title,
price,
description,
});
// Follow pagination
await enqueueLinks({
selector: '.pagination a.next',
});
},
maxRequestsPerCrawl: 1000,
});
await crawler.run(['https://example-shop.com/products']);
Playwright Crawler (JavaScript-Heavy Sites)
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Wait for dynamic content
await page.waitForSelector('.product-card');
// Extract data
const products = await page.$$eval('.product-card', (cards) =>
cards.map((card) => ({
name: card.querySelector('h2')?.textContent,
price: card.querySelector('.price')?.textContent,
image: card.querySelector('img')?.getAttribute('src'),
}))
);
await Dataset.pushData(products);
// Click "Load More" and scrape
const loadMore = page.locator('button.load-more');
if (await loadMore.isVisible()) {
await loadMore.click();
await page.waitForTimeout(2000);
}
},
headless: true,
maxRequestsPerCrawl: 500,
});
await crawler.run(['https://spa-shop.com']);
Proxy Configuration
import { ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://user:pass@proxy1.com:8080',
'http://user:pass@proxy2.com:8080',
],
});
const crawler = new CheerioCrawler({
proxyConfiguration,
// Proxies rotate automatically
async requestHandler({ request, $ }) {
// ...
},
});
Session Management
import { SessionPool } from 'crawlee';
const crawler = new CheerioCrawler({
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 100,
sessionOptions: {
maxUsageCount: 50, // Retire session after 50 uses
},
},
async requestHandler({ session, $ }) {
if ($('.captcha').length) {
session.retire(); // Bad session, get a new one
throw new Error('Captcha detected');
}
},
});
Export Data
// Auto-saved to ./storage/datasets/default/
// Export as JSON, CSV, etc.
import { Dataset } from 'crawlee';
const dataset = await Dataset.open();
await dataset.exportToCSV('output');
await dataset.exportToJSON('output');
Need production-ready scrapers? Check out my Apify actors — pre-built scrapers for Reddit, HN, Product Hunt, and more. Email spinov001@gmail.com for custom scraping projects.
Crawlee, Scrapy, or Playwright raw — what do you scrape with? Comment below!
Top comments (0)