What actually happens during a TLS handshake and why does it sometimes fail even with a valid certificate?

#webdev #security #web #devops

The short answer: a TLS handshake is a multi-step cryptographic negotiation and most failures in production have nothing to do with the certificate being invalid. They happen inside the negotiation itself.
Here's what the handshake actually involves and where things go wrong under real infrastructure.

The core sequence:
The client sends a ClientHello with the TLS version it supports, a list of cipher suites, a random nonce and critically a Server Name Indication (SNI) extension that tells the server which hostname it's requesting. On shared infrastructure with multiple virtual hosts behind a reverse proxy, this SNI extension is how the right certificate gets selected. If SNI isn't passed through correctly by a load balancer, the wrong cert gets served silently.
The server responds with its chosen cipher suite, its own nonce and its certificate chain (not just the leaf certificate the full chain including intermediates). Key exchange happens next: in TLS 1.2 via an encrypted pre-master secret, in TLS 1.3 via a redesigned ephemeral key exchange that saves an entire round trip. Finally both sides exchange Finished messages authenticated over the full transcript. Only then is the session considered established.

Why valid certificates still cause failures:
The most common production issue I've encountered isn't certificate validity it's session resumption. TLS handshakes are expensive asymmetric operations. To avoid repeating them on every reconnect, TLS uses session tickets: the server encrypts session state with a private key and hands it to the client. On reconnect, the client presents the ticket and both sides skip the asymmetric step.
This breaks silently when multiple server instances don't share the same ticket key. One instance issues a ticket, the client reconnects to a different instance, that instance can't decrypt the ticket and it falls back to a full handshake. No error is thrown. You'll only notice this as unexpectedly elevated TLS CPU load across your fleet.

The 0-RTT problem in TLS 1.3:
TLS 1.3 is faster and more secure by default it removed the entire legacy negotiation surface that made downgrade attacks possible. But its 0-RTT resumption feature, which lets clients send data before the handshake completes, introduces a replay attack risk. Early data can be replayed by an attacker. For GET requests this is typically acceptable; for POST endpoints with side effects, it's a real threat that most teams haven't evaluated.

Certificate chain completeness:
Browsers are forgiving they cache intermediates and can fetch missing ones automatically. Backend services like curl, OpenSSL clients and internal APIs are not. A certificate chain missing its intermediate will fail in backend calls with an error that looks identical to an expired certificate. This becomes acutely painful after renewals where the issuing intermediate changes.
Always validate your chain with openssl s_client -connect yourdomain.com:443 -showcerts. It behaves the way your backend services do, not the way browsers do.

Takeaway:
Treat TLS as a distributed protocol with failure modes at the negotiation, resumption and chain-validation layers — not as a binary installed/not-installed state. The certificate is usually the last thing that's actually wrong.

Wrote up the full breakdown here if anyone wants the deeper dive: What Actually Happens During a TLS Handshake

DEV Community

What actually happens during a TLS handshake and why does it sometimes fail even with a valid certificate?

Top comments (0)