Vhub Systems

Posted on Apr 3

Browser Fingerprinting Explained: What Websites Know About Your Scraper (And How to Fix It)

#webscraping #security #antibot

When your Playwright script gets blocked after 10 requests, it's usually not your IP.

Modern anti-bot systems like Cloudflare, PerimeterX, and DataDome use browser fingerprinting to identify automated traffic. They collect 40-60 signals from your browser and score them against a model trained on millions of real user sessions.

Here's what they're actually checking.

Tier 1: The Instant Giveaways

These signals are binary — you either pass or fail:

navigator.webdriver

When Chrome is launched by WebDriver (Playwright, Selenium, Puppeteer), it sets navigator.webdriver = true. Every serious anti-bot system checks this first.

Fix: Patch it at the JS level before any page code runs:

await page.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
});

Note: This alone isn't enough — other signals will still expose you.

Headless Chrome Detection

Headless Chrome leaks its nature through:

navigator.plugins length = 0 (real browsers have 3-7 plugins)
navigator.languages not set correctly
Missing Chrome-specific objects (window.chrome, window.chrome.runtime)
screen.colorDepth = 24 by default in headless (real users have varied values)

Fix: Use playwright-stealth or puppeteer-extra-plugin-stealth to patch all of these.

Tier 2: Behavioral Signals

These are harder to fake because they require time and interaction:

Mouse Movement Entropy

Real users don't move their mouse in straight lines. Anti-bots measure:

Path curvature (Bezier curves vs straight lines)
Velocity variation (acceleration/deceleration)
Micro-movements (natural hand tremor)
Time-on-page before first interaction

Fix: Libraries like ghost-cursor generate realistic mouse paths. Still imperfect — trained models can distinguish synthetic curves.

Click Timing

Real users don't click buttons at exactly 0ms after page load. Anti-bots check:

Time between page load and first interaction
Time between mousedown and mouseup events
Click coordinates (too perfectly centered = bot)

Fix: Add random delays (300-2000ms), use page.mouse.move() to approach the element before clicking.

Scroll Patterns

Real users scroll, stop, read, scroll more. Scripts that scroll to 100% of the page in one smooth motion are flagged.

Fix: Simulate reading — scroll to 30%, pause 2-8 seconds, scroll to 60%, pause, etc.

Tier 3: Hardware & Environment Signals

These are the hardest to fake:

WebGL Fingerprint

WebGL renders a test image using your GPU. The exact output varies by GPU model, driver version, and OS. Anti-bots hash this and compare against known real device profiles.

Virtual machines and cloud environments render WebGL differently than physical hardware — this is a strong signal.

Fix: Inject a static WebGL hash that matches a common real GPU. Libraries like FingerprintJS Spoofing can help, but cloud VM rendering is hard to fully mask.

Canvas Fingerprint

Similar to WebGL — renders text and shapes, hashes the result. Cloud VMs use Mesa/LLVMpipe rendering which produces distinctive outputs.

Fix: Override HTMLCanvasElement.prototype.getContext to return modified pixel data.

Audio Fingerprint

Processes audio through the Web Audio API. The output varies by OS/hardware and is used as a stable identifier.

Fix: Override AudioContext.prototype.createOscillator and related methods.

Tier 4: Network Signals

TLS Fingerprint (JA3)

Even before your browser sends an HTTP request, the TLS handshake reveals your browser type. JA3 hashing of the TLS ClientHello is a strong signal — Python's requests library has a completely different JA3 signature than Chrome.

Fix: Use curl_cffi instead of requests — it impersonates Chrome's TLS fingerprint.

HTTP/2 Fingerprint

HTTP/2 connection parameters (header ordering, stream priorities, window sizes) vary by browser. Python's httpx has a different HTTP/2 fingerprint than Chrome.

Fix: curl_cffi also handles HTTP/2 fingerprint spoofing.

The Practical Checklist

For light scraping (no anti-bot, just rate limits):

curl_cffi with Chrome impersonation
Rotate IPs
Random delays between requests

For moderate anti-bot (basic fingerprinting):

Playwright + playwright-stealth
Realistic user agents
Random delays + scroll simulation

For heavy anti-bot (Cloudflare, PerimeterX):

Playwright + stealth + ghost-cursor
Residential proxy rotation
Real browser profiles with history
Budget for CAPTCHA solving services ($2-5/1K)

For enterprise anti-bot (DataDome, Kasada, Akamai Bot Manager):

Specialized tools (zenrows, scrapingbee, Brightdata Scraping Browser)
Or use a pre-built actor that handles this for you

When to Stop Fighting Anti-Bot

Beyond a certain point, bypassing enterprise-grade anti-bot systems costs more than using a pre-built solution.

For most scraping use cases — B2B contacts, product prices, job listings, social stats — there are pre-built Apify actors that already handle fingerprinting and proxy rotation. The Apify Scrapers Bundle ($29) includes 30 actors covering the most common data types, each configured with appropriate anti-detection settings.

Build the fingerprinting defense yourself when you have a unique target site. Use pre-built actors when the data type is common.

What anti-bot system are you fighting? Drop the domain in the comments and I'll tell you which tier it falls in and the practical fix.

DEV Community