The Detection Arms Race
You launch your Puppeteer script, it works perfectly in testing, then fails in production. The site knows you are a bot. But how?
Modern bot detection goes far beyond checking user agents. Let's dive into exactly how sites detect headless browsers and how to defend against each technique.
Detection Method 1: The WebDriver Flag
The simplest check. Every automated browser sets navigator.webdriver = true:
// What sites check
if (navigator.webdriver) {
// Block this visitor
}
Defense in Python with Playwright:
from playwright.sync_api import sync_playwright
def create_stealth_browser():
p = sync_playwright().start()
browser = p.chromium.launch(
headless=True,
args=["--disable-blink-features=AutomationControlled"]
)
context = browser.new_context()
# Remove webdriver flag
context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
""")
return browser, context
Detection Method 2: Chrome DevTools Protocol
Sites detect if CDP (Chrome DevTools Protocol) is active:
// Check for CDP artifacts
if (window.cdc_adoQpoasnfa76pfcZLmcfl_Array ||
window.cdc_adoQpoasnfa76pfcZLmcfl_Promise) {
// Automated browser detected
}
This is why undetected-chromedriver exists — it patches these artifacts:
import undetected_chromedriver as uc
driver = uc.Chrome(headless=True)
driver.get("https://nowsecure.nl") # Bot detection test site
Detection Method 3: Browser Fingerprinting
Sites build a fingerprint from dozens of browser properties:
// Canvas fingerprinting
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
ctx.textBaseline = 'top';
ctx.font = '14px Arial';
ctx.fillText('Hello', 2, 2);
const fingerprint = canvas.toDataURL();
// Headless browsers produce different canvas renders
// WebGL fingerprinting
const gl = canvas.getContext('webgl');
const renderer = gl.getParameter(gl.RENDERER);
// Headless Chrome: "Google SwiftShader"
// Real Chrome: "ANGLE (Intel, ...)"
Defense:
# Inject realistic WebGL values
context.add_init_script("""
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
if (parameter === 37445) return 'Intel Inc.';
if (parameter === 37446) return 'Intel Iris OpenGL Engine';
return getParameter.call(this, parameter);
};
""")
Detection Method 4: Missing Browser APIs
Headless browsers lack certain APIs that real browsers have:
// Notification API check
if (!window.Notification) {
// Probably headless
}
// Permission check
navigator.permissions.query({name: 'notifications'}).then(perm => {
if (perm.state === 'prompt') {
// Real browser behavior
}
});
// Plugin check
if (navigator.plugins.length === 0) {
// Headless browsers have no plugins
}
Defense:
context.add_init_script("""
// Fake plugins
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5] // Non-empty
});
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
""")
Detection Method 5: Behavioral Analysis
Advanced detection tracks how users interact:
// Mouse movement patterns
let movements = [];
document.addEventListener('mousemove', (e) => {
movements.push({x: e.clientX, y: e.clientY, t: Date.now()});
});
// Bots move in straight lines or not at all
// Humans have natural curves and micro-movements
// Timing analysis
const loadTime = performance.timing.domContentLoadedEventEnd -
performance.timing.navigationStart;
// Bots often load pages unnaturally fast
Defense:
import random
import asyncio
async def human_like_interaction(page):
# Random mouse movements
for _ in range(random.randint(3, 7)):
x = random.randint(100, 800)
y = random.randint(100, 600)
await page.mouse.move(x, y, steps=random.randint(10, 25))
await asyncio.sleep(random.uniform(0.1, 0.5))
# Scroll naturally
for _ in range(random.randint(2, 5)):
delta = random.randint(100, 400)
await page.mouse.wheel(0, delta)
await asyncio.sleep(random.uniform(0.5, 1.5))
Detection Method 6: TLS Fingerprinting
Your HTTP library's TLS handshake is unique. Python requests has a fingerprint that screams "bot":
# Standard requests - detectable TLS fingerprint
import requests
resp = requests.get(url) # JA3 hash identifies this as Python
# curl_cffi - mimics Chrome's TLS fingerprint
from curl_cffi import requests as cf_requests
resp = cf_requests.get(url, impersonate="chrome120") # Matches real Chrome
Detection Method 7: IP Reputation
Datacenter IPs are in public databases. Sites check instantly:
# Datacenter IP = instant block
# Residential IP = trusted
# Use residential proxies from ThorData
proxies = {"https": "http://user:pass@residential.thordata.com:9000"}
resp = requests.get(url, proxies=proxies)
ThorData provides residential proxies that pass IP reputation checks.
The Complete Stealth Stack
For maximum evasion, combine all defenses:
from playwright.async_api import async_playwright
import random
import asyncio
async def stealth_scrape(url):
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-dev-shm-usage",
]
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
locale="en-US",
timezone_id="America/New_York",
color_scheme="light",
)
# Apply all stealth patches
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1,2,3]});
Object.defineProperty(navigator, 'languages', {get: () => ['en-US','en']});
const origGetParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(p) {
if (p === 37445) return 'Intel Inc.';
if (p === 37446) return 'Intel Iris OpenGL Engine';
return origGetParameter.call(this, p);
};
""")
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
# Simulate human behavior
await human_like_interaction(page)
content = await page.content()
await browser.close()
return content
Or Just Use an API
All of the above is a lot of work to maintain. ScraperAPI handles all detection bypass automatically — they keep up with the arms race so you do not have to.
Monitor your detection rates with ScrapeOps to know when your evasion techniques stop working.
Conclusion
Bot detection is a multi-layered system. No single technique catches every bot, and no single evasion defeats every detector. The key is understanding what each layer checks and ensuring your scraper passes all of them. For production use, a managed proxy service is almost always more cost-effective than maintaining your own stealth infrastructure.
Top comments (0)