Rom

Posted on Apr 16

How We Reverse Engineered a TLS Fingerprinting System

#networking #scraper #datacollection #ai

Modern anti-bot infrastructure doesn't just look at what you send - it looks at how you connect. TLS fingerprinting is one of the most effective and least-understood layers of bot detection. Here's how we pulled it apart.

The problem

At Clerix, we build real-time intelligence infrastructure. That means our systems need to maintain clean, stable connections to extract structured data at scale - deterministically, reliably, and without triggering detection layers that have nothing to do with the content of the request.

One day, connections that had been stable for months started failing silently. No HTTP error. No rate limit. Just... nothing. The connection would complete, a valid response would come back, and then subsequent requests from the same session would be met with garbage data or empty bodies.

We ruled out IP reputation. We ruled out cookie or session state. We eventually isolated it to the TLS handshake itself.

What is TLS fingerprinting?

When a client initiates a TLS connection, it sends a ClientHello message. This message contains:

Cipher suites - the list of encryption algorithms the client supports, in order
Extensions - features like SNI, ALPN, session tickets, supported groups
Compression methods
TLS version

The combination of these fields - especially the order - forms a near-unique fingerprint of the client library being used.

The two dominant fingerprinting standards you'll encounter are:

JA3 - MD5 hash of: TLS version, ciphers, extensions, elliptic curves, and elliptic curve point formats. Developed by Salesforce, widely used.

JA3N / JA3S - Variants that normalize extension order or fingerprint the server response.

Here's what a JA3 string looks like before hashing:

771,4866-4867-4865-49196-49200-159-52393-52392-52394-49195-49199-158-49188-49192-107-49187-49191-103-49162-49172-57-49161-49171-51-157-156-61-60-53-47-255,0-11-10-13172-16-22-23-49-13-43-45-51-21,29-23-24-25,0

Breaking it down:

771 = TLS 1.2
4866-4867-... = cipher suite list (decimal)
0-11-10-... = extension types
29-23-24-25 = supported elliptic curves
0 = EC point formats

Every major HTTP client has a recognizable JA3. curl has one. Python's requests library has one. Node's https module has one. They're catalogued and blacklisted.

Capturing the handshake

Our first step was passive observation - capturing what our outbound ClientHello actually looked like from the server's perspective.

We stood up a simple TLS inspection proxy using mitmproxy with a custom addon:

from mitmproxy import ctx
import struct

class TLSInspector:
    def tls_client_hello(self, data):
        hello = data.client_hello
        ctx.log.info(f"SNI: {hello.extensions.get('server_name')}")
        ctx.log.info(f"Ciphers: {hello.cipher_suites}")
        ctx.log.info(f"Extensions: {[e.type for e in hello.extensions]}")

addons = [TLSInspector()]

We also used Wireshark with the tls display filter and exported the ClientHello bytes directly:

tls.handshake.type == 1

Then we fed the raw bytes into a local JA3 calculator to confirm what hash we were generating. The result matched what we expected: our hash was showing up in commercial threat intel feeds as "non-browser."

What makes a fingerprint detectable

The key insight is that TLS fingerprints aren't just about which features you claim to support - they're about the default behavior of the underlying library.

Python's ssl module, for example, hardcodes cipher suite order based on OpenSSL's compiled defaults. Even if you upgrade TLS versions, the order is deterministic and well-known.

Specific tells we found:

Cipher suite ordering - Python/OpenSSL prefers different suites than Chrome's BoringSSL
Extension presence and order - Chrome includes a compress_certificate extension (type 27). Most HTTP libraries don't.
GREASE values - Chrome injects random "garbage" values (GREASE - Generate Random Extensions And Sustain Extensibility) into the handshake to prevent ossification. JA3N was specifically designed to normalize these out.
Padding extension - Chrome pads its ClientHello to avoid certain sizes that trigger middle-box bugs. Pure library clients don't.

Spoofing the fingerprint

Once we understood the signal, we had several options:

Option 1: Use a browser engine directly

Tools like Playwright/Puppeteer use actual browser TLS stacks. Effective, but heavyweight for infrastructure that needs throughput.

Option 2: Patch OpenSSL at runtime

Possible, fragile, not portable.

Option 3: Use a library that gives you control

This is the approach that scaled. Libraries like curl-impersonate compile curl against BoringSSL (Chrome's TLS library) and expose Chrome's exact cipher/extension order. There are Python wrappers (curl_cffi) that expose this at the session level:

from curl_cffi import requests as cffi_requests

session = cffi_requests.Session(impersonate="chrome120")
response = session.get("https://target.example.com")

Under the hood, this sends a ClientHello that is byte-for-byte identical to Chrome 120. Same cipher suites, same extensions, same GREASE, same padding.

We benchmarked this against our previous stack:

Approach	JA3 Hash	Detection rate	Throughput
`requests` + `httpx`	`d9f4be3f...` (Python/OpenSSL)	High	High
Playwright	Chrome-identical	Very low	Low
`curl_cffi` (Chrome120)	Chrome-identical	Very low	High

curl_cffi won on both axes.

JA4 - the newer generation

JA3 has known weaknesses: it's easy to spoof once you know it's being checked. JA4 was introduced by FoxIO in 2023 as a more robust successor.

JA4 encodes:

Protocol version
SNI presence
Number of ciphers
Number of extensions
First ALPN value
Sorted cipher suites (order-independent)
Sorted extensions (order-independent)

The sorting is the key difference - it makes JA4 resistant to order-shuffling spoofs. However, it also means the feature set itself becomes the signal. If you claim to support exactly the extensions that Chrome supports, you'll match Chrome's JA4 - regardless of order.

This is an arms race. Detection moves to behavioral signals when the cryptographic ones get spoofed.

What we learned

A few things that weren't obvious at the start:

TLS fingerprinting is almost always one layer in a stack. Defeating JA3 alone rarely wins. Real detection systems combine JA3/JA4 with HTTP/2 fingerprinting (stream weights, header order, SETTINGS frames), TCP fingerprinting (TTL, window size, options), and behavioral analysis. Solving the TLS layer just moves you to the next one.

Normalization matters more than individual values. The most suspicious thing isn't any single cipher - it's inconsistency. A ClientHello that claims to be Chrome but uses Python's HTTP/2 stack is incoherent and trivially flagged.

Impersonation fidelity has to go all the way down. curl_cffi handles the TLS layer. But you still need to match HTTP/2 pseudo-header order, SETTINGS frame parameters, and window update behavior. Fortunately, since curl_cffi uses curl's full stack (including nghttp2), the HTTP/2 framing matches Chrome's as well.

The bigger picture

TLS fingerprinting is a microcosm of a broader dynamic in the infrastructure space: the signal keeps moving lower in the stack. It started with IP reputation, moved to cookies and headers, then to TLS, now increasingly to TCP and even timing characteristics.

At Clerix, this is the layer we operate at. Understanding these mechanisms - not just working around them but properly modeling them - is what makes the difference between infrastructure that works in controlled conditions and infrastructure that holds up under production adversarial conditions.

If you're building anything in this space and want to go deeper, the two most useful starting points are: