The Anti-Bot Detection Checklist I Use Before Every Scraping Project
Every scraping project I take on starts with this checklist. Not because I'm paranoid — but because I've learned the hard way that production scrapers fail silently. They return 200 OK with garbage data, or they get rate-limited so gradually you don't notice for days.
This is the systematic approach I've refined over 50+ scraping projects.
Pre-Scraping: Know Your Target
1. Identify the CDN and Protection Stack
Before writing a single line of code, check what you're up against:
# Check CDN and headers
curl -I https://target-site.com
# Look for these common protection headers:
# X-Engine: akamai-html-protection
# X-Served-By: DataDome
# cf-ray: Cloudflare
# X-Bot-Status: blocked
Common protection platforms:
-
Cloudflare → Look for
cf-rayand__cfduidcookies -
DataDome → Look for
datadomein headers or scripts -
PerimeterX → Look for
_pxffcookies -
Akamai → Look for
akamai-html-protectionheaders
2. Check Robots.txt Respectfully
curl https://target-site.com/robots.txt | grep -v "^#"
Don't take this as gospel — but it's a good signal. If they explicitly disallow your use case, that's a flag.
3. Map the Site's JavaScript Rendering
Some sites are fully static (fast, easy). Others render everything with JavaScript (need Playwright/Puppeteer). Check:
// Quick check - fetch raw HTML vs rendered content
// If they differ significantly, you need JS rendering
const https = require('https');
const html = await fetch('https://target.com').then(r => r.text());
const hasAngularVueReact = /ng-app|vue|react|__NEXT_DATA__/i.test(html);
console.log('Needs JS rendering:', hasAngularVueReact);
Code-Time: Defensive Patterns
4. Rotate User Agents
const USER_AGENTS = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120 Safari',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120 Edge/120',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120 Firefox/120',
// Add 10-15 more realistic user agents
];
function randomUA() {
return USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)];
}
Never use a single UA string. Rotate through 10+ realistic ones.
5. Respect Retry-After Headers
async function scrapeWithRetry(url, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
const response = await fetch(url, {
headers: { 'User-Agent': randomUA() }
});
if (response.status === 429) {
const retryAfter = response.headers.get('Retry-After') || 60;
console.log(`Rate limited. Waiting ${retryAfter}s...`);
await sleep(retryAfter * 1000);
continue;
}
return response;
}
throw new Error('Max retries exceeded');
}
6. Detect Block Patterns Early
function detectBlock(response) {
const html = response.text();
// Common block signals
if (html.includes('Access Denied')) return 'aws-waf';
if (html.includes('captcha')) return 'captcha';
if (html.includes('Please enable cookies')) return 'cloudflare';
if (html.length < 1000 && html.includes('checking your browser')) return 'cloudflare-js';
if (html.includes('datadome')) return 'datadome';
return null;
}
7. Add Random Delays Between Requests
function randomDelay(min = 2000, max = 7000) {
return Math.floor(Math.random() * (max - min) + min);
}
async function scrapeWithDelay(url) {
await sleep(randomDelay());
return scrape(url);
}
Infrastructure: Proxy Rotation
8. Use Residential Proxies (Not Datacenter)
This is the most impactful single change you can make:
| Proxy Type | Block Rate | Cost | Speed |
|---|---|---|---|
| Datacenter | 70-90% on protected sites | Cheap | Fast |
| Rotating Residential | 5-15% on protected sites | $$ | Medium |
| ISP Static | <5% | $$$ | Fast |
For anything beyond hobby projects, residential proxy rotation is worth the cost. With XCrawl's residential network:
// One line change — everything else stays the same
const xcrawl = new XCrawlScraper({ apiKey: process.env.XCRAWL_API_KEY });
// No more managing proxy lists, rotations, or bans
9. Sticky Sessions for Batching
When scraping a single site multiple times, use sticky sessions so you appear as the same user:
// XCrawl handles this automatically
const result = await xcrawl.scrape('https://site.com/page', {
stickySession: true // Same proxy for 2 minutes
});
Validation: Before Going Live
10. Validate Data Quality
Never assume a 200 response means good data:
function validateData(data) {
const required = ['title', 'price', 'url'];
const missing = required.filter(f => !data[f]);
if (missing.length > 0) {
console.warn('Missing fields:', missing.join(', '));
return false;
}
if (data.price && typeof data.price !== 'number') {
console.warn('Invalid price type');
return false;
}
return true;
}
11. Health Check Monitoring
Set up automated health checks that alert you when your scraper starts returning garbage:
// Run this every hour
async function healthCheck() {
const testUrl = 'https://target-site.com/product-page';
const result = await scrape(testUrl);
const blockType = detectBlock(result);
if (blockType) {
sendAlert(`Scraper blocked by ${blockType}!`);
return false;
}
if (!validateData(result.parsed)) {
sendAlert('Scraper returning invalid data!');
return false;
}
return true;
}
12. Always Store Raw HTML
This is the most overlooked step. Store every response as raw HTML before parsing:
async function scrapeAndStore(url) {
const response = await fetch(url);
const raw = await response.text();
// Store raw for debugging
await db.rawResponses.insert({
url,
raw_html: raw,
timestamp: new Date(),
status: response.status
});
// Then parse
const parsed = parseHTML(raw);
return parsed;
}
When your parser breaks (and it will), you'll thank yourself for the raw data.
The Full Picture
A production-ready scraper isn't just code — it's a system:
Monitoring → Alerting → Health Checks → Data Validation → Backup Parser
↑ ↑ ↑ ↑
Residential Proxies ──────── Sticky Sessions ──── Error Handling
Quick Wins
If you only implement three things from this list:
- Residential proxies (biggest win)
- Block detection (prevents silent failures)
- Store raw HTML (enables debugging)
Everything else is incremental improvement.
Questions about specific anti-bot systems? I've dealt with all of them — drop a comment.
Top comments (0)