Double CHEN

Posted on May 8

The fingerprint layer is why your Playwright + residential proxies still get blocked

#python #webscraping #automation #tutorial

The thread that started this

A couple months ago I saw a post on r/webscraping that summed up the current state of things better than I ever could:

"We have scrapers, ordinary ones, browser automation… we use proxies for location based blocking, residential proxies for data centre blockers, we rotate the user agent, we have some third party unblockers too. But often, we still get captchas, and CloudFlare can get in the way too."

Four layers of evasion. Still getting challenge pages. 167 upvotes, 52 comments, archived without a real solution. I've been writing some variant of that person's scraper for the past two years, and I think there's a cleaner answer now than there was when they posted.

This is a writeup of what I think the actual problem is, what I tested, and the CLI-based setup I've ended up with.

The four layers that don't get you past Cloudflare

These are the evasion techniques the OP was already using:

1. User agent rotation

Changes the User-Agent header per request. This catches early-2010s anti-bot rules. Cloudflare, DataDome, Akamai all moved past it years ago — they don't trust the UA string at all, they build their own fingerprint.

2. Residential proxies

Gets you past IP-reputation lists (data-center ASNs are pre-flagged). Useful. But the proxy exit IP is one of dozens of signals — not a primary discriminator on any WAF I've worked with.

3. Geo/location proxies

Solves "this site only serves US users" or similar rate-limit-per-country patterns. Good for what it does. Doesn't affect bot detection.

4. Third-party unblocker services

These are typically a browser-farm-as-a-service with some stealth built in. Works well until the service itself gets fingerprinted (which happens to all of them eventually — when a service becomes popular, anti-bot vendors train their models on it).

What actually gets checked

The top comments on that thread are a good inventory. I've verified each of these against my own traffic:

TLS fingerprint (JA3 / JA3S)
Cipher suite order, extension list, supported groups — all sent during the TLS handshake, before any HTTP. Python requests has a JA3 that screams "non-browser." Curl-cffi and rnet both let you mimic a real browser's TLS, which is a big reason they work better. For browser automation you get this "for free" as long as you're using a real browser underneath.

JavaScript-level fingerprint
This is the big one. Scripts loaded on page render query:

navigator.webdriver (should be undefined, not true)
navigator.plugins (Playwright/Puppeteer vanilla: empty array)
navigator.languages (should match headers)
Canvas rendering hash (deterministic per GPU + driver, the real moat)
WebGL renderer string
Font list (from document.fonts)
Audio context hash
window.chrome.runtime (missing in headless Chrome by default)

Plugin-based stealth libraries patch the top 3-5. The WAF vendors add the remaining ones to their detection within a release cycle, and you end up in a patch-loop.

Behavioral
Mouse movement curvature, keyboard interval distribution, scroll velocity, click-before-load latency. These matter for high-tier targets (DataDome enterprise, Akamai Bot Manager).

Network latency between IP and proxy
This is the gotcha. One comment on the original thread describes a month-long debug where every other signal looked clean, but the timing of browser actions vs. the measured round-trip didn't match what a single user would look like. The fix was positioning the data-center IP near the residential proxy exit.

What I tested

I built a small test harness: 5 target sites known to use different anti-bot tiers, and ran each stack against each site for 200 sessions over a week. I counted successful page loads (no challenge page appeared).

Stack	Light WAF	Medium (reCAPTCHA)	Hard (CF Bot Mgmt)	Very Hard (DataDome)
Plain Playwright	198/200	142/200	6/200	0/200
Playwright + stealth plugin	200/200	189/200	94/200	2/200
Playwright + stealth + residential	200/200	195/200	127/200	11/200
Camoufox (anti-detect browser)	200/200	198/200	173/200	34/200
browser-act CLI + residential	200/200	200/200	191/200	47/200

The gap between plain Playwright and any stealth-aware setup on Cloudflare Bot Management was the most dramatic — 6/200 vs. 191/200. The gap between different stealth setups on CF was smaller but still significant. DataDome Enterprise remains hard for everything except mobile-device-based approaches.

Working with browser-act CLI

I ended up moving most of my scrapers over to browser-act. Not because it's strictly the highest-scoring option — Camoufox is very close — but because it's a CLI instead of a library. That changed how I write scrapers.

Install

npx skills add browser-act/skills --skill browser-act

(Uses npm's skill system; the CLI itself is a Python package that gets installed on first run.)

The commands that matter

browser-act --session myjob browser open <stealth_id> https://example.com
browser-act --session myjob wait stable
browser-act --session myjob solve-captcha
browser-act --session myjob get markdown
browser-act --session myjob state
browser-act --session myjob click 14
browser-act --session myjob input 7 "search query"

The --session flag keeps your cookie jar and fingerprint persistent across calls. Log in once, reuse the session for subsequent scrapes.

solve-captcha is the built-in Cloudflare Turnstile + reCAPTCHA v2 + hCaptcha solver. Returns solved: True in my testing on Indeed and Product Hunt in under 2 seconds each. No 2captcha/anti-captcha account needed — the CLI handles it.

get markdown is the one I didn't expect to use as much as I do. It returns an LLM-optimized markdown representation of the page, stripping navigation chrome, scripts, and ad containers. On the Product Hunt AI directory:

Raw HTML: 680,193 chars
browser-act markdown: 49,272 chars
Reduction: 92.7%

If you're running LLM-in-the-loop scraping, that's ~14x fewer input tokens per page. Compounds very fast on high-volume jobs.

A concrete refactor

Here's a login-then-scrape flow. Old version (playwright + stealth + retry glue):

from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
import asyncio, random

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            args=['--disable-blink-features=AutomationControlled']
        )
        context = await browser.new_context(
            user_agent='Mozilla/5.0 ...',
            viewport={'width': 1920, 'height': 1080},
        )
        page = await context.new_page()
        await stealth_async(page)
        # ... ~80 more lines: cookie banner dismissal, login form, challenge retry loop

New version, same flow:

#!/bin/bash
SESSION=mywork
browser-act --session $SESSION browser open $BID https://target.com/login
browser-act --session $SESSION wait stable
browser-act --session $SESSION input 3 "$USER"
browser-act --session $SESSION input 4 "$PASS"
browser-act --session $SESSION click 5
browser-act --session $SESSION wait stable
browser-act --session $SESSION solve-captcha  # handles CF challenge if one appears
browser-act --session $SESSION navigate https://target.com/dashboard
browser-act --session $SESSION get markdown > dashboard.md

My actual scraper went from ~800 lines of Python glue to ~120 lines of bash calling the CLI. Much less to maintain, and the debug story is better — each command prints its result, so I can tee and inspect.

Limits and caveats

Being honest about what this doesn't do:

DataDome Enterprise and Akamai Bot Manager are still hard. My 47/200 on DataDome is better than other stacks in the test but not production-viable for aggressive scraping. For those targets you're looking at mobile device farms or paid bypass APIs.
Proxy rotation is not automatic; configure with --dynamic-proxy <region> or --custom-proxy <url>. Still need a proxy source if you're at scale.
Browser profiles are stored locally. If you want to share a logged-in session across machines, you need to export/import the profile yourself. Not a one-click thing.
CLI-as-library is a trade-off: you lose the fine-grained control of a Playwright API. For 80% of scraping flows it's fine, for the 20% where you need, say, CDP-level network interception mid-session, you'd stay with Playwright.

Takeaway

The Reddit thread is accurate: UA rotation + residential proxies + unblockers aren't enough on their own because they don't touch the fingerprint layer, which is what modern WAFs actually gate on. Getting past Cloudflare Bot Management or similar means controlling:

TLS fingerprint (use a real browser or curl-cffi)
JavaScript-level signals (canvas, webdriver, plugins — stealth patches for all of them)
Captcha handling (built-in solver or paid service)
Session persistence (so you're not re-solving on every request)

browser-act is one option that bundles all four; Camoufox + a captcha service is another. The exact tool matters less than recognizing that the fingerprint layer is where the game is now.

Happy to hear if you've tested other setups — especially on the DataDome/Akamai side, where I think the community's collective knowledge is still thin.

DEV Community