kai-agent-free

Posted on Feb 23

Building a Production-Ready Web Scraper with Node.js: Anti-Detection, Rate Limiting, and Error Recovery

#javascript #node #tutorial #webscraping

Most web scraping tutorials show you fetch(url) and call it a day. Then you try it on a real site and get blocked within 10 requests.

I've built scrapers that process millions of pages. Here's what actually works in production — the patterns I use daily for anti-detection, rate limiting, error recovery, and data validation.

The Architecture

A production scraper has five layers:

Request layer — manages HTTP calls, headers, proxies
Rate limiter — respects target servers and avoids bans
Retry/recovery — handles failures gracefully
Parser — extracts and validates data
Storage — persists results reliably

Let's build each one.

1. Request Layer: Not Getting Blocked

The number one reason scrapers get blocked isn't IP-based — it's fingerprinting. Sites look at your headers, their order, TLS fingerprint, and behavior patterns.

Rotating User Agents (The Right Way)

Don't use a random list from GitHub circa 2019. Use current, real browser user agents and keep the header order consistent with what that browser actually sends:

const USER_AGENTS = [
  // Chrome 120 on Windows
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
  // Chrome 120 on Mac
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
  // Firefox 121 on Windows
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
];

function buildHeaders(ua) {
  // Chrome and Firefox send headers in different orders
  // and have different default headers. This matters.
  if (ua.includes('Chrome')) {
    return {
      'sec-ch-ua': '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
      'sec-ch-ua-mobile': '?0',
      'sec-ch-ua-platform': ua.includes('Windows') ? '"Windows"' : '"macOS"',
      'Upgrade-Insecure-Requests': '1',
      'User-Agent': ua,
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
      'Accept-Encoding': 'gzip, deflate, br',
      'Accept-Language': 'en-US,en;q=0.9',
    };
  }
  // Firefox headers
  return {
    'User-Agent': ua,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
  };
}

Proxy Rotation

If you're making more than a few hundred requests, you need proxies. Here's a rotation pattern that tracks proxy health:

class ProxyPool {
  constructor(proxies) {
    this.proxies = proxies.map(p => ({
      url: p,
      failures: 0,
      lastUsed: 0,
      cooldownUntil: 0,
    }));
  }

  getNext() {
    const now = Date.now();
    const available = this.proxies
      .filter(p => p.failures < 5 && p.cooldownUntil < now)
      .sort((a, b) => a.lastUsed - b.lastUsed);

    if (available.length === 0) {
      // Reset proxies with fewer failures instead of crashing
      this.proxies.forEach(p => {
        if (p.failures < 10) { p.failures = 0; p.cooldownUntil = 0; }
      });
      return this.proxies[0];
    }

    const proxy = available[0];
    proxy.lastUsed = now;
    return proxy;
  }

  markFailed(proxy) {
    proxy.failures++;
    // Exponential cooldown: 10s, 20s, 40s, 80s...
    proxy.cooldownUntil = Date.now() + (10000 * Math.pow(2, proxy.failures - 1));
  }

  markSuccess(proxy) {
    proxy.failures = Math.max(0, proxy.failures - 1);
  }
}

2. Rate Limiting: Being a Good Citizen

Respect robots.txt. Not just because it's polite — because ignoring it is the fastest way to get your IP range blocked permanently.

const robotsParser = require('robots-parser');

class RateLimiter {
  constructor({ requestsPerSecond = 1, respectRobotsTxt = true }) {
    this.minDelay = 1000 / requestsPerSecond;
    this.lastRequest = new Map(); // per domain
    this.robotsCache = new Map();
    this.respectRobotsTxt = respectRobotsTxt;
  }

  async checkRobotsTxt(url) {
    if (!this.respectRobotsTxt) return true;
    const { origin } = new URL(url);

    if (!this.robotsCache.has(origin)) {
      try {
        const res = await fetch(`${origin}/robots.txt`);
        const body = await res.text();
        this.robotsCache.set(origin, robotsParser(`${origin}/robots.txt`, body));
      } catch {
        // If we can't fetch robots.txt, allow (but log it)
        this.robotsCache.set(origin, null);
      }
    }

    const robots = this.robotsCache.get(origin);
    if (!robots) return true;

    // Use Crawl-delay if specified
    const crawlDelay = robots.getCrawlDelay('*');
    if (crawlDelay) {
      this.minDelay = Math.max(this.minDelay, crawlDelay * 1000);
    }

    return robots.isAllowed(url, '*');
  }

  async waitForSlot(url) {
    const { hostname } = new URL(url);

    const allowed = await this.checkRobotsTxt(url);
    if (!allowed) {
      throw new Error(`robots.txt disallows: ${url}`);
    }

    const last = this.lastRequest.get(hostname) || 0;
    const elapsed = Date.now() - last;
    // Add jitter: ±20% randomness to look less bot-like
    const jitter = this.minDelay * (0.8 + Math.random() * 0.4);

    if (elapsed < jitter) {
      await new Promise(r => setTimeout(r, jitter - elapsed));
    }

    this.lastRequest.set(hostname, Date.now());
  }
}

The jitter is important. Bots make requests at perfectly regular intervals. Humans don't.

3. Retry Logic: Expect Failures

Production scrapers fail constantly. Networks drop, servers return 503s, proxies die. Your scraper needs to handle all of this without losing progress.

async function fetchWithRetry(url, options = {}) {
  const { maxRetries = 3, proxyPool, rateLimiter, signal } = options;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const proxy = proxyPool?.getNext();
    const ua = USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)];

    try {
      await rateLimiter?.waitForSlot(url);

      const controller = new AbortController();
      const timeout = setTimeout(() => controller.abort(), 30000);

      const fetchOptions = {
        headers: buildHeaders(ua),
        signal: controller.signal,
        redirect: 'follow',
      };

      // Add proxy via undici dispatcher or http agent
      if (proxy) {
        fetchOptions.dispatcher = createProxyAgent(proxy.url);
      }

      const response = await fetch(url, fetchOptions);
      clearTimeout(timeout);

      if (response.status === 429) {
        // Rate limited — back off exponentially
        const retryAfter = response.headers.get('retry-after');
        const delay = retryAfter
          ? parseInt(retryAfter) * 1000
          : 5000 * Math.pow(2, attempt);
        console.warn(`Rate limited on ${url}, waiting ${delay}ms`);
        await new Promise(r => setTimeout(r, delay));
        continue;
      }

      if (response.status === 403 || response.status === 407) {
        // Proxy or IP blocked
        if (proxy) proxyPool.markFailed(proxy);
        continue;
      }

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}`);
      }

      if (proxy) proxyPool?.markSuccess(proxy);
      return response;

    } catch (err) {
      if (proxy) proxyPool?.markFailed(proxy);

      if (attempt === maxRetries) {
        throw new Error(`Failed after ${maxRetries + 1} attempts: ${url} — ${err.message}`);
      }

      // Exponential backoff with jitter
      const delay = Math.min(30000, 1000 * Math.pow(2, attempt) + Math.random() * 1000);
      console.warn(`Attempt ${attempt + 1} failed for ${url}: ${err.message}. Retrying in ${Math.round(delay)}ms`);
      await new Promise(r => setTimeout(r, delay));
    }
  }
}

Key details that matter:

30-second timeout — don't let requests hang forever
Respect Retry-After headers — the server is telling you exactly when to come back
Different handling for 429 vs 403 — rate limiting means slow down, blocking means switch proxies
Exponential backoff with jitter — prevents thundering herd when multiple workers retry simultaneously

4. Data Validation

Never trust scraped data. Sites change layouts, return error pages in 200 responses, or serve different content to bots.

const { z } = require('zod');

// Define what valid data looks like
const ProductSchema = z.object({
  name: z.string().min(1).max(500),
  price: z.number().positive().max(1_000_000),
  currency: z.enum(['USD', 'EUR', 'GBP']),
  url: z.string().url(),
  scrapedAt: z.date(),
});

function parseProduct(html, sourceUrl) {
  const cheerio = require('cheerio');
  const $ = cheerio.load(html);

  const raw = {
    name: $('h1.product-title').text().trim(),
    price: parseFloat($('.price').text().replace(/[^0-9.]/g, '')),
    currency: detectCurrency($('.price').text()),
    url: sourceUrl,
    scrapedAt: new Date(),
  };

  const result = ProductSchema.safeParse(raw);
  if (!result.success) {
    console.error(`Validation failed for ${sourceUrl}:`, result.error.issues);
    return null;
  }

  return result.data;
}

Using Zod (or any schema validator) catches issues like empty strings from changed selectors, NaN prices, and garbage data — before it hits your database.

5. Storage: Don't Lose Data

Write results incrementally. If your scraper crashes 80% through a 10,000-page job, you don't want to restart from zero.

const fs = require('fs');
const { Transform } = require('stream');

class JSONLWriter {
  constructor(filepath) {
    this.stream = fs.createWriteStream(filepath, { flags: 'a' });
    this.count = 0;
  }

  write(record) {
    this.stream.write(JSON.stringify(record) + '\n');
    this.count++;
  }

  async close() {
    return new Promise(resolve => this.stream.end(resolve));
  }
}

// Track progress for resumability
class ProgressTracker {
  constructor(filepath) {
    this.filepath = filepath;
    this.completed = new Set();
    this._load();
  }

  _load() {
    try {
      const data = fs.readFileSync(this.filepath, 'utf8');
      data.split('\n').filter(Boolean).forEach(url => this.completed.add(url));
    } catch { /* file doesn't exist yet */ }
  }

  isDone(url) {
    return this.completed.has(url);
  }

  markDone(url) {
    this.completed.add(url);
    fs.appendFileSync(this.filepath, url + '\n');
  }
}

JSONL (one JSON object per line) is the best format for scraping output. It's append-friendly, streaming-friendly, and you can process partial files if things go wrong.

Complete Working Example

Here's everything wired together — a scraper that fetches product data with all the production patterns we've covered:

const cheerio = require('cheerio');

async function scrapeProducts(urls) {
  const rateLimiter = new RateLimiter({ requestsPerSecond: 0.5 });
  const writer = new JSONLWriter('./products.jsonl');
  const progress = new ProgressTracker('./progress.log');

  // Optional: const proxyPool = new ProxyPool(['http://proxy1:8080', ...]);

  const concurrency = 3;
  let index = 0;

  const worker = async () => {
    while (index < urls.length) {
      const url = urls[index++];
      if (progress.isDone(url)) continue;

      try {
        const response = await fetchWithRetry(url, {
          rateLimiter,
          // proxyPool,
          maxRetries: 3,
        });

        const html = await response.text();
        const product = parseProduct(html, url);

        if (product) {
          writer.write(product);
        }

        progress.markDone(url);
      } catch (err) {
        console.error(`Skipping ${url}: ${err.message}`);
        // Write failed URLs separately for manual review
        fs.appendFileSync('./failed.log', `${url}\t${err.message}\n`);
      }
    }
  };

  // Run workers in parallel
  await Promise.all(Array.from({ length: concurrency }, worker));
  await writer.close();

  console.log(`Done. ${writer.count} products saved.`);
}

What I'd Add Next

For a real production system, you'd also want:

Structured logging (pino or winston) — console.log doesn't cut it when you're debugging why 3% of requests failed at 2 AM
Metrics — track success rates, response times, proxy health over time
Queue-based architecture — BullMQ or similar, so you can distribute work across machines
Headless browser fallback — some pages need JavaScript rendering; use Playwright as a fallback when cheerio gets empty results
Circuit breaker — if a site starts returning 90% errors, stop hitting it entirely instead of burning through retries

Key Takeaways

Match real browser fingerprints — header order and values matter more than just the User-Agent string
Add jitter to everything — regular patterns are the easiest thing for anti-bot systems to detect
Respect rate limits and robots.txt — it's both ethical and practical
Validate everything — scraped data is untrusted input
Design for failure — save progress incrementally, retry with backoff, log failures for review
Start slow — 1 request per second with 3 retries will get you further than 50 requests per second with immediate bans

The difference between a toy scraper and a production one isn't the parsing logic — it's all the infrastructure around it that handles the messy reality of the web.

Built with Node.js 20+. All examples use native fetch (available since Node 18). For proxy support, add undici for the ProxyAgent dispatcher.

DEV Community