Web Scraping Anti-Bot Guide: Delays, User-Agents, and When to Use Proxies

#webdev #security #javascript #node

Getting blocked is the #1 frustration in web scraping. Here's how to avoid it.

Rule 1: Always Set User-Agent

const headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
};

Without a User-Agent, many sites return 403 immediately.

Rule 2: Add Random Delays

const delay = (ms) => new Promise(r => setTimeout(r, ms));

for (const url of urls) {
  const data = await scrape(url);
  await delay(1000 + Math.random() * 3000); // 1-4s random
}

Fixed delays get detected. Random delays look human.

Rule 3: Rotate User-Agents

const agents = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
];

const randomAgent = agents[Math.floor(Math.random() * agents.length)];

Rule 4: Handle Rate Limits Gracefully

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    const res = await fetch(url, { headers });
    if (res.status === 429) {
      const wait = Math.pow(2, i) * 5000; // exponential backoff
      await delay(wait);
      continue;
    }
    return res;
  }
  throw new Error('Max retries reached');
}

Rule 5: Use APIs Instead

The best anti-bot strategy: don't scrape HTML at all. Use the site's JSON API.

7 sites that return JSON directly — no blocks, no captchas.

When You Actually Need Proxies

Scraping 10,000+ pages from one site
Site uses IP-based rate limiting
Need geographic diversity (prices vary by country)

For most small jobs (< 1000 pages), delays + user-agent rotation is enough.

More Resources

Getting blocked? I'll handle it. $20-50 depending on complexity. 77 production scrapers. Email: Spinov001@gmail.com | Hire me

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email: spinov001@gmail.com*