Alex Chen

Posted on Mar 23

Browser Fingerprinting and CAPTCHAs: Why Headless Chrome Gets Caught and How to Fix It

You launch your Playwright scraper in headless mode. It works on page 1. By page 3, you're staring at a CAPTCHA. Switch to headed mode — no CAPTCHA for 50 pages.

What's going on? Browser fingerprinting. Anti-bot systems don't just check your IP — they analyze dozens of browser properties to decide if you're human. Let's break down exactly what they check and how to handle it.

What Is Browser Fingerprinting?

Every browser exposes properties through JavaScript APIs. Anti-bot services collect these into a "fingerprint" — a unique identifier for your browser session:

// Just a sample of what gets collected
{
  userAgent: navigator.userAgent,
  platform: navigator.platform,
  languages: navigator.languages,
  hardwareConcurrency: navigator.hardwareConcurrency,
  deviceMemory: navigator.deviceMemory,
  screenResolution: [screen.width, screen.height],
  colorDepth: screen.colorDepth,
  timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
  webglVendor: getWebGLVendor(),
  webglRenderer: getWebGLRenderer(),
  canvas: getCanvasFingerprint(),
  audioContext: getAudioFingerprint(),
  fonts: getInstalledFonts(),
  plugins: navigator.plugins.length,
  touchSupport: navigator.maxTouchPoints,
}

How Headless Chrome Gets Detected

Headless Chrome has several telltale signs:

1. The navigator.webdriver Flag

// Headless Chrome:
navigator.webdriver  // true ← BUSTED

// Real Chrome:
navigator.webdriver  // undefined or false

2. Missing Plugins

// Real Chrome: has PDF viewer, etc.
navigator.plugins.length  // 3-5

// Headless Chrome:
navigator.plugins.length  // 0 ← suspicious

3. WebGL Renderer

// Real Chrome:
getWebGLRenderer()  // "ANGLE (NVIDIA GeForce...)"

// Headless Chrome:
getWebGLRenderer()  // "Google SwiftShader" ← dead giveaway

4. Chrome Object

// Real Chrome:
window.chrome  // {runtime: {...}, ...}

// Headless Chrome:
window.chrome  // undefined ← missing

5. Permissions API Behavior

// Real Chrome:
navigator.permissions.query({name: "notifications"})
  .then(p => p.state)  // "prompt" or "denied"

// Headless Chrome sometimes:
// Throws or returns unexpected values

Detecting These Leaks in Your Scraper

Before trying to fix things, find out what's leaking:

# fingerprint_audit.py
from playwright.sync_api import sync_playwright

DETECTION_SCRIPT = """
() => {
    const results = {};

    // Test 1: webdriver flag
    results.webdriver = navigator.webdriver;

    // Test 2: plugins
    results.pluginCount = navigator.plugins.length;

    // Test 3: languages
    results.languages = navigator.languages;

    // Test 4: chrome object
    results.hasChrome = !!window.chrome;
    results.hasChromeRuntime = !!(
        window.chrome && window.chrome.runtime
    );

    // Test 5: WebGL
    try {
        const canvas = document.createElement('canvas');
        const gl = canvas.getContext('webgl');
        const debugInfo = gl.getExtension(
            'WEBGL_debug_renderer_info'
        );
        results.webglVendor = gl.getParameter(
            debugInfo.UNMASKED_VENDOR_WEBGL
        );
        results.webglRenderer = gl.getParameter(
            debugInfo.UNMASKED_RENDERER_WEBGL
        );
    } catch(e) {
        results.webglError = e.message;
    }

    // Test 6: Permissions
    results.permissionsAPI = !!navigator.permissions;

    // Test 7: Screen dimensions
    results.screen = {
        width: screen.width,
        height: screen.height,
        availWidth: screen.availWidth,
        availHeight: screen.availHeight,
        colorDepth: screen.colorDepth,
    };

    // Test 8: Hardware
    results.hardwareConcurrency = navigator.hardwareConcurrency;
    results.deviceMemory = navigator.deviceMemory;
    results.maxTouchPoints = navigator.maxTouchPoints;

    // Test 9: Headless indicators
    results.userAgent = navigator.userAgent;
    results.platform = navigator.platform;

    return results;
}
"""

def audit_fingerprint():
    with sync_playwright() as p:
        # Test headless
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        headless_fp = page.evaluate(DETECTION_SCRIPT)
        browser.close()

        # Compare with headed
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        headed_fp = page.evaluate(DETECTION_SCRIPT)
        browser.close()

        # Show differences
        print("=== Fingerprint Differences ===")
        for key in headless_fp:
            if headless_fp[key] != headed_fp.get(key):
                print(f"  {key}:")
                print(f"    Headless: {headless_fp[key]}")
                print(f"    Headed:   {headed_fp.get(key)}")

audit_fingerprint()

Fixing the Fingerprint Leaks

Approach 1: Playwright Stealth (Quick Fix)

Use playwright-stealth\ to patch common detection points:

# pip install playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    # Apply stealth patches
    stealth_sync(page)

    page.goto("https://target-site.com")
    # Now navigator.webdriver = false,
    # plugins are spoofed, etc.

Approach 2: Manual Patches (More Control)

from playwright.sync_api import sync_playwright

def apply_stealth(page):
    """Apply individual stealth patches."""

    # 1. Remove webdriver flag
    page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });
    """)

    # 2. Fake plugins
    page.add_init_script("""
        Object.defineProperty(navigator, 'plugins', {
            get: () => {
                const plugins = [
                    {
                        name: 'Chrome PDF Plugin',
                        description: 'Portable Document Format',
                        filename: 'internal-pdf-viewer',
                        length: 1
                    },
                    {
                        name: 'Chrome PDF Viewer',
                        description: '',
                        filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai',
                        length: 1
                    },
                    {
                        name: 'Native Client',
                        description: '',
                        filename: 'internal-nacl-plugin',
                        length: 2
                    }
                ];
                plugins.length = 3;
                return plugins;
            }
        });
    """)

    # 3. Fake chrome object
    page.add_init_script("""
        window.chrome = {
            runtime: {
                onConnect: null,
                onMessage: null,
            },
            loadTimes: function() {},
            csi: function() {},
            app: {}
        };
    """)

    # 4. Fix permissions query
    page.add_init_script("""
        const originalQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (params) => {
            if (params.name === 'notifications') {
                return Promise.resolve({
                    state: Notification.permission
                });
            }
            return originalQuery(params);
        };
    """)

    # 5. Fake languages
    page.add_init_script("""
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });
    """)

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
        ]
    )

    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        ),
        locale="en-US",
        timezone_id="America/New_York",
    )

    page = context.new_page()
    apply_stealth(page)
    page.goto("https://target-site.com")

Approach 3: CDP Connection to Real Browser

The most reliable approach — connect to a real browser instance:

from playwright.sync_api import sync_playwright

# Launch a real Chrome with remote debugging
# chrome --remote-debugging-port=9222

with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp(
        "http://localhost:9222"
    )

    context = browser.contexts[0]
    page = context.new_page()

    # This IS a real browser — nothing to patch
    page.goto("https://target-site.com")

When Stealth Isn't Enough: Solving CAPTCHAs

Even with perfect fingerprinting, some sites will still show CAPTCHAs — especially on:

First visit from a new IP
Login/signup flows
After N requests in a session
High-value pages (checkout, pricing)

That's when you need a solving service:

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
import httpx

def scrape_with_stealth_and_solving(url: str):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        stealth_sync(page)

        page.goto(url)

        # Check if we got a CAPTCHA despite stealth
        captcha_el = page.query_selector(
            '[class*="captcha"], '
            '[data-sitekey], '
            '.g-recaptcha, '
            '.h-captcha, '
            '.cf-turnstile'
        )

        if captcha_el:
            print("CAPTCHA detected despite stealth — solving...")
            sitekey = (
                captcha_el.get_attribute("data-sitekey") 
                or extract_sitekey(page.content())
            )
            captcha_type = detect_type(captcha_el)

            # Solve via API
            token = solve_captcha(
                captcha_type=captcha_type,
                sitekey=sitekey,
                url=url
            )

            # Inject token
            page.evaluate(f"""() => {{
                const textarea = document.querySelector(
                    'textarea[name*="captcha-response"]'
                );
                if (textarea) textarea.value = '{token}';
            }}""")

            page.click('button[type="submit"]')
            page.wait_for_load_state("networkidle")

        # Now scrape the actual content
        data = extract_data(page)
        browser.close()
        return data

Fingerprint Consistency Checklist

When setting up your scraper, make sure these are all consistent:

# ✅ Consistent setup
context = browser.new_context(
    # Match the User-Agent to the viewport/platform
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
    viewport={"width": 1920, "height": 1080},  # Desktop
    locale="en-US",
    timezone_id="America/New_York",

    # Screen size should match viewport
    screen={"width": 1920, "height": 1080},

    # Match color scheme to majority of users
    color_scheme="light",
)

# ❌ Inconsistent (will get flagged)
context = browser.new_context(
    user_agent="...iPhone...",       # Says mobile
    viewport={"width": 1920, ...},  # But desktop viewport!
    locale="zh-CN",                 # Chinese locale
    timezone_id="America/New_York", # But US timezone!
)

Testing Your Stealth Setup

Run your browser against common detection sites:

DETECTION_SITES = [
    "https://bot.sannysoft.com/",
    "https://arh.antoinevastel.com/bots/areyouheadless",
    "https://infosimples.github.io/detect-headless/",
    "https://fingerprintjs.github.io/fingerprintjs/",
]

def test_stealth(page):
    for site in DETECTION_SITES:
        page.goto(site)
        page.wait_for_timeout(3000)
        page.screenshot(
            path=f"stealth_test_{site.split('/')[2]}.png"
        )
        print(f"Screenshotted: {site}")

The Decision Tree

Start
  ├─ Can you use a real browser (CDP)?
  │   └─ Yes → Best stealth, highest resource cost
  ├─ Is the site moderately protected?
  │   └─ playwright-stealth + good fingerprint config
  ├─ Is the site heavily protected?
  │   └─ Stealth + CAPTCHA solving API for fallback
  └─ Is it an API endpoint (no JS)?
      └─ Skip browser entirely, just solve CAPTCHAs
          via HTTP and submit tokens

Key Takeaways

Headless detection is fingerprint-based — not just User-Agent
WebGL renderer and navigator.webdriver are the biggest tells
Consistency matters — a mobile UA with a desktop viewport is obvious
Stealth plugins fix 80% — but determined anti-bot systems need more
Always have a CAPTCHA solver fallback — perfect stealth is impossible

For solving CAPTCHAs when stealth isn't enough, check out passxapi-python — it handles reCAPTCHA, hCaptcha, Turnstile, and FunCaptcha with a unified API.

What's your stealth setup? Have you found detection vectors I didn't cover? Let me know in the comments.

DEV Community