Charles

Posted on Jun 8

The Anti-Bot Detection Checklist I Use Before Every Scraping Project

#scraping #bots #tutorial #node

The Anti-Bot Detection Checklist I Use Before Every Scraping Project

Every scraping project I take on starts with this checklist. Not because I'm paranoid — but because I've learned the hard way that production scrapers fail silently. They return 200 OK with garbage data, or they get rate-limited so gradually you don't notice for days.

This is the systematic approach I've refined over 50+ scraping projects.

Pre-Scraping: Know Your Target

1. Identify the CDN and Protection Stack

Before writing a single line of code, check what you're up against:

# Check CDN and headers
curl -I https://target-site.com

# Look for these common protection headers:
# X-Engine: akamai-html-protection
# X-Served-By: DataDome
# cf-ray: Cloudflare
# X-Bot-Status: blocked

Common protection platforms:

Cloudflare → Look for cf-ray and __cfduid cookies
DataDome → Look for datadome in headers or scripts
PerimeterX → Look for _pxff cookies
Akamai → Look for akamai-html-protection headers

2. Check Robots.txt Respectfully

curl https://target-site.com/robots.txt | grep -v "^#"

Don't take this as gospel — but it's a good signal. If they explicitly disallow your use case, that's a flag.

3. Map the Site's JavaScript Rendering

Some sites are fully static (fast, easy). Others render everything with JavaScript (need Playwright/Puppeteer). Check:

// Quick check - fetch raw HTML vs rendered content
// If they differ significantly, you need JS rendering

const https = require('https');
const html = await fetch('https://target.com').then(r => r.text());
const hasAngularVueReact = /ng-app|vue|react|__NEXT_DATA__/i.test(html);
console.log('Needs JS rendering:', hasAngularVueReact);

Code-Time: Defensive Patterns

4. Rotate User Agents

const USER_AGENTS = [
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120 Safari',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120 Edge/120',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120 Firefox/120',
  // Add 10-15 more realistic user agents
];

function randomUA() {
  return USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)];
}

Never use a single UA string. Rotate through 10+ realistic ones.

5. Respect Retry-After Headers

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    const response = await fetch(url, {
      headers: { 'User-Agent': randomUA() }
    });

    if (response.status === 429) {
      const retryAfter = response.headers.get('Retry-After') || 60;
      console.log(`Rate limited. Waiting ${retryAfter}s...`);
      await sleep(retryAfter * 1000);
      continue;
    }

    return response;
  }
  throw new Error('Max retries exceeded');
}

6. Detect Block Patterns Early

function detectBlock(response) {
  const html = response.text();

  // Common block signals
  if (html.includes('Access Denied')) return 'aws-waf';
  if (html.includes('captcha')) return 'captcha';
  if (html.includes('Please enable cookies')) return 'cloudflare';
  if (html.length < 1000 && html.includes('checking your browser')) return 'cloudflare-js';
  if (html.includes('datadome')) return 'datadome';

  return null;
}

7. Add Random Delays Between Requests

function randomDelay(min = 2000, max = 7000) {
  return Math.floor(Math.random() * (max - min) + min);
}

async function scrapeWithDelay(url) {
  await sleep(randomDelay());
  return scrape(url);
}

Infrastructure: Proxy Rotation

8. Use Residential Proxies (Not Datacenter)

This is the most impactful single change you can make:

Proxy Type	Block Rate	Cost	Speed
Datacenter	70-90% on protected sites	Cheap	Fast
Rotating Residential	5-15% on protected sites	$$	Medium
ISP Static	<5%	$$$	Fast

For anything beyond hobby projects, residential proxy rotation is worth the cost. With XCrawl's residential network:

// One line change — everything else stays the same
const xcrawl = new XCrawlScraper({ apiKey: process.env.XCRAWL_API_KEY });
// No more managing proxy lists, rotations, or bans

9. Sticky Sessions for Batching

When scraping a single site multiple times, use sticky sessions so you appear as the same user:

// XCrawl handles this automatically
const result = await xcrawl.scrape('https://site.com/page', {
  stickySession: true // Same proxy for 2 minutes
});

Validation: Before Going Live

10. Validate Data Quality

Never assume a 200 response means good data:

function validateData(data) {
  const required = ['title', 'price', 'url'];
  const missing = required.filter(f => !data[f]);

  if (missing.length > 0) {
    console.warn('Missing fields:', missing.join(', '));
    return false;
  }

  if (data.price && typeof data.price !== 'number') {
    console.warn('Invalid price type');
    return false;
  }

  return true;
}

11. Health Check Monitoring

Set up automated health checks that alert you when your scraper starts returning garbage:

// Run this every hour
async function healthCheck() {
  const testUrl = 'https://target-site.com/product-page';
  const result = await scrape(testUrl);

  const blockType = detectBlock(result);
  if (blockType) {
    sendAlert(`Scraper blocked by ${blockType}!`);
    return false;
  }

  if (!validateData(result.parsed)) {
    sendAlert('Scraper returning invalid data!');
    return false;
  }

  return true;
}

12. Always Store Raw HTML

This is the most overlooked step. Store every response as raw HTML before parsing:

async function scrapeAndStore(url) {
  const response = await fetch(url);
  const raw = await response.text();

  // Store raw for debugging
  await db.rawResponses.insert({
    url,
    raw_html: raw,
    timestamp: new Date(),
    status: response.status
  });

  // Then parse
  const parsed = parseHTML(raw);
  return parsed;
}

When your parser breaks (and it will), you'll thank yourself for the raw data.

The Full Picture

A production-ready scraper isn't just code — it's a system:

Monitoring → Alerting → Health Checks → Data Validation → Backup Parser
     ↑           ↑              ↑                ↑
  Residential Proxies ──────── Sticky Sessions ──── Error Handling

Quick Wins

If you only implement three things from this list:

Residential proxies (biggest win)
Block detection (prevents silent failures)
Store raw HTML (enables debugging)

Everything else is incremental improvement.

Questions about specific anti-bot systems? I've dealt with all of them — drop a comment.

DEV Community

The Anti-Bot Detection Checklist I Use Before Every Scraping Project

The Anti-Bot Detection Checklist I Use Before Every Scraping Project

Pre-Scraping: Know Your Target

1. Identify the CDN and Protection Stack

2. Check Robots.txt Respectfully

3. Map the Site's JavaScript Rendering

Code-Time: Defensive Patterns

4. Rotate User Agents

5. Respect Retry-After Headers

6. Detect Block Patterns Early

7. Add Random Delays Between Requests

Infrastructure: Proxy Rotation

8. Use Residential Proxies (Not Datacenter)

9. Sticky Sessions for Batching

Validation: Before Going Live

10. Validate Data Quality

11. Health Check Monitoring

12. Always Store Raw HTML

The Full Picture

Quick Wins

Top comments (0)