Most web scraping tutorials show you fetch(url) and call it a day. Then you try it on a real site and get blocked within 10 requests.
I've built scrapers that process millions of pages. Here's what actually works in production — the patterns I use daily for anti-detection, rate limiting, error recovery, and data validation.
The Architecture
A production scraper has five layers:
- Request layer — manages HTTP calls, headers, proxies
- Rate limiter — respects target servers and avoids bans
- Retry/recovery — handles failures gracefully
- Parser — extracts and validates data
- Storage — persists results reliably
Let's build each one.
1. Request Layer: Not Getting Blocked
The number one reason scrapers get blocked isn't IP-based — it's fingerprinting. Sites look at your headers, their order, TLS fingerprint, and behavior patterns.
Rotating User Agents (The Right Way)
Don't use a random list from GitHub circa 2019. Use current, real browser user agents and keep the header order consistent with what that browser actually sends:
const USER_AGENTS = [
// Chrome 120 on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
// Chrome 120 on Mac
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
// Firefox 121 on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
];
function buildHeaders(ua) {
// Chrome and Firefox send headers in different orders
// and have different default headers. This matters.
if (ua.includes('Chrome')) {
return {
'sec-ch-ua': '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': ua.includes('Windows') ? '"Windows"' : '"macOS"',
'Upgrade-Insecure-Requests': '1',
'User-Agent': ua,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
};
}
// Firefox headers
return {
'User-Agent': ua,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
};
}
Proxy Rotation
If you're making more than a few hundred requests, you need proxies. Here's a rotation pattern that tracks proxy health:
class ProxyPool {
constructor(proxies) {
this.proxies = proxies.map(p => ({
url: p,
failures: 0,
lastUsed: 0,
cooldownUntil: 0,
}));
}
getNext() {
const now = Date.now();
const available = this.proxies
.filter(p => p.failures < 5 && p.cooldownUntil < now)
.sort((a, b) => a.lastUsed - b.lastUsed);
if (available.length === 0) {
// Reset proxies with fewer failures instead of crashing
this.proxies.forEach(p => {
if (p.failures < 10) { p.failures = 0; p.cooldownUntil = 0; }
});
return this.proxies[0];
}
const proxy = available[0];
proxy.lastUsed = now;
return proxy;
}
markFailed(proxy) {
proxy.failures++;
// Exponential cooldown: 10s, 20s, 40s, 80s...
proxy.cooldownUntil = Date.now() + (10000 * Math.pow(2, proxy.failures - 1));
}
markSuccess(proxy) {
proxy.failures = Math.max(0, proxy.failures - 1);
}
}
2. Rate Limiting: Being a Good Citizen
Respect robots.txt. Not just because it's polite — because ignoring it is the fastest way to get your IP range blocked permanently.
const robotsParser = require('robots-parser');
class RateLimiter {
constructor({ requestsPerSecond = 1, respectRobotsTxt = true }) {
this.minDelay = 1000 / requestsPerSecond;
this.lastRequest = new Map(); // per domain
this.robotsCache = new Map();
this.respectRobotsTxt = respectRobotsTxt;
}
async checkRobotsTxt(url) {
if (!this.respectRobotsTxt) return true;
const { origin } = new URL(url);
if (!this.robotsCache.has(origin)) {
try {
const res = await fetch(`${origin}/robots.txt`);
const body = await res.text();
this.robotsCache.set(origin, robotsParser(`${origin}/robots.txt`, body));
} catch {
// If we can't fetch robots.txt, allow (but log it)
this.robotsCache.set(origin, null);
}
}
const robots = this.robotsCache.get(origin);
if (!robots) return true;
// Use Crawl-delay if specified
const crawlDelay = robots.getCrawlDelay('*');
if (crawlDelay) {
this.minDelay = Math.max(this.minDelay, crawlDelay * 1000);
}
return robots.isAllowed(url, '*');
}
async waitForSlot(url) {
const { hostname } = new URL(url);
const allowed = await this.checkRobotsTxt(url);
if (!allowed) {
throw new Error(`robots.txt disallows: ${url}`);
}
const last = this.lastRequest.get(hostname) || 0;
const elapsed = Date.now() - last;
// Add jitter: ±20% randomness to look less bot-like
const jitter = this.minDelay * (0.8 + Math.random() * 0.4);
if (elapsed < jitter) {
await new Promise(r => setTimeout(r, jitter - elapsed));
}
this.lastRequest.set(hostname, Date.now());
}
}
The jitter is important. Bots make requests at perfectly regular intervals. Humans don't.
3. Retry Logic: Expect Failures
Production scrapers fail constantly. Networks drop, servers return 503s, proxies die. Your scraper needs to handle all of this without losing progress.
async function fetchWithRetry(url, options = {}) {
const { maxRetries = 3, proxyPool, rateLimiter, signal } = options;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const proxy = proxyPool?.getNext();
const ua = USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)];
try {
await rateLimiter?.waitForSlot(url);
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000);
const fetchOptions = {
headers: buildHeaders(ua),
signal: controller.signal,
redirect: 'follow',
};
// Add proxy via undici dispatcher or http agent
if (proxy) {
fetchOptions.dispatcher = createProxyAgent(proxy.url);
}
const response = await fetch(url, fetchOptions);
clearTimeout(timeout);
if (response.status === 429) {
// Rate limited — back off exponentially
const retryAfter = response.headers.get('retry-after');
const delay = retryAfter
? parseInt(retryAfter) * 1000
: 5000 * Math.pow(2, attempt);
console.warn(`Rate limited on ${url}, waiting ${delay}ms`);
await new Promise(r => setTimeout(r, delay));
continue;
}
if (response.status === 403 || response.status === 407) {
// Proxy or IP blocked
if (proxy) proxyPool.markFailed(proxy);
continue;
}
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
if (proxy) proxyPool?.markSuccess(proxy);
return response;
} catch (err) {
if (proxy) proxyPool?.markFailed(proxy);
if (attempt === maxRetries) {
throw new Error(`Failed after ${maxRetries + 1} attempts: ${url} — ${err.message}`);
}
// Exponential backoff with jitter
const delay = Math.min(30000, 1000 * Math.pow(2, attempt) + Math.random() * 1000);
console.warn(`Attempt ${attempt + 1} failed for ${url}: ${err.message}. Retrying in ${Math.round(delay)}ms`);
await new Promise(r => setTimeout(r, delay));
}
}
}
Key details that matter:
- 30-second timeout — don't let requests hang forever
-
Respect
Retry-Afterheaders — the server is telling you exactly when to come back - Different handling for 429 vs 403 — rate limiting means slow down, blocking means switch proxies
- Exponential backoff with jitter — prevents thundering herd when multiple workers retry simultaneously
4. Data Validation
Never trust scraped data. Sites change layouts, return error pages in 200 responses, or serve different content to bots.
const { z } = require('zod');
// Define what valid data looks like
const ProductSchema = z.object({
name: z.string().min(1).max(500),
price: z.number().positive().max(1_000_000),
currency: z.enum(['USD', 'EUR', 'GBP']),
url: z.string().url(),
scrapedAt: z.date(),
});
function parseProduct(html, sourceUrl) {
const cheerio = require('cheerio');
const $ = cheerio.load(html);
const raw = {
name: $('h1.product-title').text().trim(),
price: parseFloat($('.price').text().replace(/[^0-9.]/g, '')),
currency: detectCurrency($('.price').text()),
url: sourceUrl,
scrapedAt: new Date(),
};
const result = ProductSchema.safeParse(raw);
if (!result.success) {
console.error(`Validation failed for ${sourceUrl}:`, result.error.issues);
return null;
}
return result.data;
}
Using Zod (or any schema validator) catches issues like empty strings from changed selectors, NaN prices, and garbage data — before it hits your database.
5. Storage: Don't Lose Data
Write results incrementally. If your scraper crashes 80% through a 10,000-page job, you don't want to restart from zero.
const fs = require('fs');
const { Transform } = require('stream');
class JSONLWriter {
constructor(filepath) {
this.stream = fs.createWriteStream(filepath, { flags: 'a' });
this.count = 0;
}
write(record) {
this.stream.write(JSON.stringify(record) + '\n');
this.count++;
}
async close() {
return new Promise(resolve => this.stream.end(resolve));
}
}
// Track progress for resumability
class ProgressTracker {
constructor(filepath) {
this.filepath = filepath;
this.completed = new Set();
this._load();
}
_load() {
try {
const data = fs.readFileSync(this.filepath, 'utf8');
data.split('\n').filter(Boolean).forEach(url => this.completed.add(url));
} catch { /* file doesn't exist yet */ }
}
isDone(url) {
return this.completed.has(url);
}
markDone(url) {
this.completed.add(url);
fs.appendFileSync(this.filepath, url + '\n');
}
}
JSONL (one JSON object per line) is the best format for scraping output. It's append-friendly, streaming-friendly, and you can process partial files if things go wrong.
Complete Working Example
Here's everything wired together — a scraper that fetches product data with all the production patterns we've covered:
const cheerio = require('cheerio');
async function scrapeProducts(urls) {
const rateLimiter = new RateLimiter({ requestsPerSecond: 0.5 });
const writer = new JSONLWriter('./products.jsonl');
const progress = new ProgressTracker('./progress.log');
// Optional: const proxyPool = new ProxyPool(['http://proxy1:8080', ...]);
const concurrency = 3;
let index = 0;
const worker = async () => {
while (index < urls.length) {
const url = urls[index++];
if (progress.isDone(url)) continue;
try {
const response = await fetchWithRetry(url, {
rateLimiter,
// proxyPool,
maxRetries: 3,
});
const html = await response.text();
const product = parseProduct(html, url);
if (product) {
writer.write(product);
}
progress.markDone(url);
} catch (err) {
console.error(`Skipping ${url}: ${err.message}`);
// Write failed URLs separately for manual review
fs.appendFileSync('./failed.log', `${url}\t${err.message}\n`);
}
}
};
// Run workers in parallel
await Promise.all(Array.from({ length: concurrency }, worker));
await writer.close();
console.log(`Done. ${writer.count} products saved.`);
}
What I'd Add Next
For a real production system, you'd also want:
-
Structured logging (pino or winston) —
console.logdoesn't cut it when you're debugging why 3% of requests failed at 2 AM - Metrics — track success rates, response times, proxy health over time
- Queue-based architecture — BullMQ or similar, so you can distribute work across machines
- Headless browser fallback — some pages need JavaScript rendering; use Playwright as a fallback when cheerio gets empty results
- Circuit breaker — if a site starts returning 90% errors, stop hitting it entirely instead of burning through retries
Key Takeaways
- Match real browser fingerprints — header order and values matter more than just the User-Agent string
- Add jitter to everything — regular patterns are the easiest thing for anti-bot systems to detect
- Respect rate limits and robots.txt — it's both ethical and practical
- Validate everything — scraped data is untrusted input
- Design for failure — save progress incrementally, retry with backoff, log failures for review
- Start slow — 1 request per second with 3 retries will get you further than 50 requests per second with immediate bans
The difference between a toy scraper and a production one isn't the parsing logic — it's all the infrastructure around it that handles the messy reality of the web.
Built with Node.js 20+. All examples use native fetch (available since Node 18). For proxy support, add undici for the ProxyAgent dispatcher.
Top comments (0)