Vhub Systems

Posted on Apr 3

TLS Fingerprinting: How Cloudflare Identifies Scrapers (And How to Bypass It)

#antibot #webscraping #security #python

Cloudflare doesn't just block scrapers by IP. In 2026, the primary detection mechanism is TLS fingerprinting — analyzing how your HTTPS connection is established before any HTTP request is sent.

Here's how it works and what you actually need to do to bypass it.

What TLS Fingerprinting Is

When you make an HTTPS request, your HTTP client and the server perform a TLS handshake. During this handshake, the client announces:

Which TLS versions it supports
Which cipher suites it prefers (and in what order)
Which extensions it includes (SNI, ALPN, etc.)
Which elliptic curves it supports

This combination creates a fingerprint. Different clients produce different fingerprints:

Python's requests library produces a known fingerprint
Chrome 120 on Windows produces a different fingerprint
Firefox 121 on macOS produces another

Cloudflare and similar services maintain databases of these fingerprints. When your scraper connects with Python's default fingerprint, Cloudflare knows before seeing any HTTP headers that this is a programmatic client, not a browser.

The Standard Python Fingerprint Problem

import requests
r = requests.get("https://cloudflare-protected-site.com")
# Cloudflare sees: Python-urllib/3.11 TLS fingerprint
# Blocked: often immediately, sometimes after first request

Python's built-in SSL implementation (via OpenSSL) produces a fingerprint that looks nothing like a browser. Specifically:

Python uses a different cipher suite order
Python typically doesn't include browser-specific TLS extensions
Python's ALPN negotiation differs from Chrome's

This is detectable at the TCP level before the HTTP layer sees anything.

The curl_cffi Solution

curl_cffi wraps libcurl with Chromium's TLS stack, producing a TLS fingerprint that matches a real browser.

from curl_cffi import requests as curl_requests

# Impersonate Chrome 120
session = curl_requests.Session(impersonate="chrome120")
response = session.get("https://cloudflare-protected-site.com")
print(response.status_code)  # 200 instead of 403

Available impersonation targets in 2026:

chrome120, chrome119, chrome110 (recommended)
firefox120, firefox110
safari17_0, safari16_5
edge99

When to use which: chrome120 is the default choice. If a site specifically blocks Chrome patterns, try safari17_0 — Safari's fingerprint is less commonly targeted in blocklists.

Advanced: httpx with TLS Customization

For more control, httpx with a custom TLS configuration can get you closer to browser fingerprints:

import httpx
import ssl

# Custom SSL context that matches Chrome's behavior more closely
def create_chrome_ssl_context():
    ctx = ssl.create_default_context()
    # Chrome's preferred cipher order (simplified)
    ctx.set_ciphers(
        "TLS_AES_128_GCM_SHA256:"
        "TLS_AES_256_GCM_SHA384:"
        "TLS_CHACHA20_POLY1305_SHA256:"
        "ECDH+AESGCM:"
        "ECDH+CHACHA20:"
        "DHE+AESGCM"
    )
    return ctx

# This alone isn't sufficient for modern Cloudflare, but helps with basic fingerprinting
async with httpx.AsyncClient(verify=create_chrome_ssl_context()) as client:
    r = await client.get(url)

In practice: for serious Cloudflare bypassing, curl_cffi is more reliable than manual SSL context configuration. The TLS fingerprint involves dozens of parameters; curl_cffi handles all of them correctly by using the actual Chromium TLS stack.

JA3 and JA3N Fingerprinting

The specific fingerprint format most services use is called JA3 (developed by Salesforce). It hashes:

SSLVersion
Ciphers (comma-separated)
Extensions (comma-separated)
EllipticCurves
EllipticCurvePointFormats

The resulting MD5 hash is compared against a database. Python requests produces 7dc465e28e1a62b68be994b34ae9eb24 — a well-known scraper fingerprint.

JA3N (the newer variant) includes additional parameters and is harder to spoof without using the actual client SSL stack.

Check your current fingerprint:

import subprocess
# Check what fingerprint your current setup produces
result = subprocess.run([
    "curl", "-v", "--tls-max", "1.3", "https://tls.peet.ws/api/all"
], capture_output=True, text=True)
# Parse the JSON response for your JA3 hash

Or visit tls.peet.ws in a browser vs. from your scraper to see the difference.

What curl_cffi Doesn't Fix

TLS fingerprinting is one layer of bot detection. Even with a perfect TLS fingerprint, you'll still be detected if:

1. Your HTTP headers are wrong:

# Wrong: missing or out-of-order headers
headers = {"User-Agent": "Mozilla/5.0", "Accept": "*/*"}

# Right: match browser header order and values exactly
headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
}

2. Your behavior patterns are robotic:

Zero delay between requests
Perfect regularity (no human-like variation)
Missing referer headers on subsequent page loads
No cookie handling

3. JavaScript challenges aren't solved:
Cloudflare's highest protection level (Under Attack Mode) serves a JavaScript challenge that must be solved before you see any content. This requires a browser (playwright) not just TLS spoofing.

Practical Decision Tree

Is the site behind Cloudflare?
├── No → requests or httpx works fine
└── Yes → check protection level
    ├── Basic (static content loads) → curl_cffi with chrome impersonation
    ├── Anti-bot (5-second check) → curl_cffi + proper headers + cookie handling
    └── Under Attack Mode (JS challenge) → playwright with stealth mode

curl_cffi in Production

from curl_cffi import requests as curl_requests
import time, random

session = curl_requests.Session(impersonate="chrome120")

def scrape_cloudflare_site(url: str) -> str:
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.google.com/",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "cross-site",
    }

    # Small random delay
    time.sleep(random.uniform(0.5, 2.0))

    response = session.get(url, headers=headers, timeout=20)

    if response.status_code == 403:
        raise Exception(f"Blocked: {response.status_code}")

    return response.text

# The session maintains cookies across requests automatically

Key notes:

Session() reuses the connection (faster, and maintains cookies)
Include Referer: google.com for first page load (natural navigation pattern)
Random delays are important — constant 0-latency requests are a detection signal

The Arms Race

Bot detection evolves continuously. What works in April 2026 may not work in October 2026. The current state:

Pure Python requests: blocked on most Cloudflare-protected sites
curl_cffi with chrome impersonation: works on 70-80% of Cloudflare sites
playwright + stealth: works on ~90% but 5-10x slower
Residential proxies + playwright: works on 95%+ but costs $5-15/GB

The progression from free to expensive matches the anti-bot sophistication you're dealing with.

Production Anti-Bot Ready Scrapers

If you need scrapers that already handle Cloudflare, I maintain 35 Apify actors with built-in anti-bot handling — proxy rotation, browser fingerprinting, and retry logic are included.

Apify Scrapers Bundle — €29 — one-time download. All actors run on Apify's infrastructure (no server needed).

DEV Community