I've built scrapers with Puppeteer, Playwright, Cheerio, and Axios. Each time I'd rebuild the same things: request queue, retry logic, proxy rotation, error handling. Then I found Crawlee — all of that comes built-in.
What Crawlee Offers
Crawlee (open source, free):
- Built-in request queue — persistent, resumable
- Auto-retry — failed requests retry automatically
- Proxy rotation — built-in proxy management
- Multiple crawlers — HTTP (Cheerio), Playwright, Puppeteer
- Auto-scaling — adjusts concurrency based on system load
- Storage — datasets, key-value stores, request queues
- TypeScript-first — full type safety
- Apify integration — deploy to Apify with zero changes
Quick Start
npx crawlee create my-scraper
cd my-scraper
npm start
HTTP Scraper (Fast)
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
maxConcurrency: 10,
maxRequestRetries: 3,
async requestHandler({ request, $, enqueueLinks }) {
// Extract data
const title = $('h1').text().trim();
const price = $('.price').text().trim();
const description = $('meta[name=description]').attr('content');
// Save to dataset
await Dataset.pushData({
url: request.url,
title,
price,
description,
scrapedAt: new Date().toISOString()
});
// Follow links (auto-queued)
await enqueueLinks({
selector: '.product-link',
label: 'PRODUCT'
});
}
});
await crawler.run(['https://example-shop.com/products']);
// Export results
const dataset = await Dataset.open();
await dataset.exportToCSV('products');
Browser Scraper (JavaScript Sites)
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
launchContext: {
launchOptions: { headless: true }
},
async requestHandler({ page, request, enqueueLinks }) {
// Wait for dynamic content
await page.waitForSelector('.product-card');
// Extract data from rendered page
const products = await page.$$eval('.product-card', cards =>
cards.map(card => ({
name: card.querySelector('h2')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
rating: card.querySelector('.stars')?.getAttribute('data-rating')
}))
);
await Dataset.pushData(products);
// Handle pagination
const nextButton = await page.$('.pagination .next');
if (nextButton) {
await enqueueLinks({ selector: '.pagination .next a' });
}
}
});
await crawler.run(['https://spa-shop.com/products']);
Proxy Rotation
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfig = new ProxyConfiguration({
proxyUrls: [
'http://user:pass@proxy1.com:8080',
'http://user:pass@proxy2.com:8080',
'http://user:pass@proxy3.com:8080'
]
});
const crawler = new CheerioCrawler({
proxyConfiguration: proxyConfig,
// Crawlee auto-rotates proxies and retries on failure
async requestHandler({ request, $ }) {
// Scraping logic
}
});
Error Handling & Retries
const crawler = new CheerioCrawler({
maxRequestRetries: 5,
async requestHandler({ request, $ }) {
const data = extractData($);
if (!data.title) throw new Error('Missing title — will retry');
await Dataset.pushData(data);
},
async failedRequestHandler({ request, error }) {
console.log(`Failed after 5 retries: ${request.url}`);
console.log(`Error: ${error.message}`);
// Log failed URLs for manual review
}
});
Deploy to Apify (One Command)
# Your Crawlee scraper becomes an Apify Actor
apify init
apify push
# Now available on Apify Store with API, scheduling, and monitoring
Why Crawlee
| Crawlee | Raw Puppeteer/Playwright |
|---|---|
| Auto-retry built in | Build retry logic |
| Request queue persistent | In-memory queue |
| Proxy rotation built in | Manual proxy handling |
| Auto-scaling | Manual concurrency |
| Export to CSV/JSON | Build export logic |
| TypeScript types | Add types yourself |
Want ready-made scrapers? Check out my web scraping actors on Apify — pre-built for Google, Amazon, Reddit, and 70+ sites.
Need a custom scraper? Email me at spinov001@gmail.com — I build production scrapers with Crawlee.
Top comments (0)