Why Playwright Gets You Blocked Even With Proxies

#python #playwright #webscraping #cybersecurity

You've spent hours setting up your scraper. Rotating proxies, real Chrome User-Agent, navigator.webdriver set to undefined, perfect headers. You run it. Blocked. Instantly.

What's going on?

Chances are you never even made it to the layer where your headers live.

The problem isn't your IP

When you start with browser automation, the logic feels solid: change the IP, the server doesn't know who I am. In 2015 that was enough. Today Akamai, Cloudflare and PerimeterX haven't cared much about your IP for a long time. What they care about is something else.

They care about how you talk before you say anything.

When your bot opens an HTTPS connection, the first thing that happens isn't an HTTP request. It's a TLS negotiation — the protocol that sets up the encrypted channel. During that negotiation your client sends a message called ClientHello containing, among other things, which cipher suites it supports, which TLS extensions it uses, and in what order it declares them.

That set of characteristics has a hash. It's called JA3. And that hash identifies with reasonable accuracy what software is opening the connection — before reading a single HTTP header, before executing any detection JavaScript, before analyzing anything.

requests has its own JA3. httpx has its own. Playwright's Chromium has its own. And none of them match a commercial Google Chrome installed on a real Windows machine.

The mismatch that gives you away

Here's the specific problem with Playwright: it downloads and uses its own Chromium binary. Technically it's the same codebase as Chrome, but the TLS fingerprint isn't identical to a commercial Chrome build.

The result is a contradiction that antibot systems catch in milliseconds:

Your headers say: "I'm Google Chrome 124 on Windows 10"
Your TLS fingerprint says: "I'm an automated Chromium binary"

Blocked. And it doesn't matter how well you've handled the JavaScript evasion layer, because the block happens at the network level. The server never even gets to run anything in the browser.

The real detection layers

It helps to understand the order these systems operate in:

Layer 1 — Network:      IP, ASN, TLS fingerprint (JA3/JA4)
Layer 2 — HTTP:         Headers, header order, User-Agent, Client Hints
Layer 3 — JavaScript:   Canvas, WebGL, AudioContext, navigator.*
Layer 4 — Behavior:     Mouse, keyboard, scroll, timing

Most evasion guides talk about layers 3 and 4. They matter, but if you fail at layer 1 the other three are irrelevant. The server never gets to execute them.

What actually works

Option 1: use the system's Chrome binary

If you point Playwright at the Chrome binary installed on the operating system, the TLS fingerprint is that of a real Chrome:

context = await playwright.chromium.launch_persistent_context(
    executable_path="/usr/bin/google-chrome",
    user_data_dir="./profile",
    headless=False,
    args=["--headless=new"],
)

The --headless=new flag isn't optional. Playwright's old headless mode disables the GPU pipeline and the Canvas fingerprint goes flat and fake. Akamai's sensor.js catches it immediately. With --headless=new the full graphics stack is preserved.

Option 2: curl-cffi for direct requests

When you don't need JavaScript rendering — APIs, simple endpoints, resource downloads — curl-cffi solves the TLS problem without a browser:

from curl_cffi.requests import AsyncSession, BrowserType

async with AsyncSession(impersonate=BrowserType.chrome124) as session:
    response = await session.get("https://example.com/api/data")

It uses libcurl with patches to reproduce Chrome's exact ClientHello, including cipher suite order and extensions. The resulting JA3 is indistinguishable from a real Chrome.

One detail that often gets missed: the Sec-Fetch-* headers. If you're hitting an API, the mode has to match what a real browser would send in that situation:

# Same-origin AJAX call
headers = {
    "Sec-Fetch-Site": "same-origin",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Dest": "empty",
}

# Full page load
headers = {
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-User": "?1",
}

Sending Sec-Fetch-Mode: navigate in an AJAX call never happens in a real browser. Akamai knows that.

The architecture that holds up best

Combine both: system Chrome for interactive navigation where you need JS, curl-cffi for direct requests where you just need data. The trick is syncing cookies between the two layers — if you don't, the session cookies generated by the browser won't be available in your direct requests and vice versa.

JA4: what comes after JA3

JA4 is the evolution of JA3, published in 2023. More robust, harder to evade with simple tricks, and more useful for analysts. If JA3 was the ID card of your TLS connection, JA4 is the ID card plus the history.

The good news is that curl-cffi covers it too — by impersonating Chrome at the libcurl level, the resulting JA4 is equally coherent.

What fixing TLS doesn't solve

Getting the TLS fingerprint right is necessary but it's not the finish line. Once you pass layer 1 they keep analyzing:

HTTP/2 fingerprinting — similar to JA3 but for the HTTP/2 protocol. The order of SETTINGS frames and WINDOW_UPDATE values also identify the client.

JavaScript fingerprinting — Canvas, WebGL, AudioContext, fonts, navigator.webdriver, window.chrome. This layer requires active spoofing via scripts that run before the page loads.

Behavior — mouse movement patterns, typing speed, timing between actions, scroll. The more advanced systems build a behavioral model per session and detect anomalies.

Each layer adds work. But it also adds robustness. A scraper that only handles TLS falls at the JavaScript layer. One that handles all layers consistently is genuinely hard to tell apart from a real user.

Next up: Canvas fingerprinting in detail — how it actually works, why deterministic per-session noise beats random noise, and how antibot systems detect poorly implemented spoofing.