- Book: System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Half the load-balancer advice on the internet is from 2018. Back then "L7 because HTTP" was a fine default. In 2026, with gRPC streaming, HTTP/3 over QUIC, and edge-first stacks doing TLS termination two hops before your VPC, that default ships outages.
If you've inherited a stack from someone who picked their LB before any of this existed, you're probably running L7 where L4 would be cheaper and L4 where you needed L7 a year ago. The fix isn't to pick a side. It's to walk the request path one hop at a time and ask the right question at each hop.
The 2018 default doesn't survive 2026
The old story was: L4 for raw speed, L7 for HTTP smarts. Both halves are now wrong in important ways.
L4 isn't automatically faster. A modern Envoy or ALB on warm connections lands inside 1–2 ms of an NLB for HTTP/1.1, and HTTP/2 multiplexing makes L7 look better than L4 on per-request cost. The "L4 is fast" intuition came from a world where L7 boxes parsed every header on every TCP connection. They don't do that anymore.
L7 isn't automatically smart either. The smarts moved to the edge. Cloudflare and Fastly already routed by path, ran your WAF, fingerprinted bots, and re-encrypted traffic before it hit your VPC. If you also pay an L7 inside your VPC to do path routing, you're doing the same parse twice: once at the edge for security and global routing, once at origin for per-service routing.
And then gRPC, HTTP/3, and mTLS broke the rest of the defaults. gRPC needs HTTP/2 end-to-end or it falls back to unary and you lose streaming. HTTP/3 is UDP, so your L4 has to actually forward UDP, not just TCP. mTLS termination decides who reads the cert and the rest of the design follows from that one choice.
The five questions that actually decide
Skip the LB feature comparison. These five answers pick your stack:
1. Where does TLS terminate? Edge, origin LB, or service? If a regulator says "mTLS pass-through to the service," your origin LB does L4 and nothing else. If your edge handles TLS and you re-encrypt to origin, you can run L7 at origin freely.
2. Is any of your traffic gRPC streaming or bidirectional? Bidi gRPC needs HTTP/2 end-to-end. Your L7 must speak HTTP/2 upstream, not downgrade. ALB does, NGINX-OSS only via the grpc_pass directive, classic NGINX with proxy_pass will silently break long-lived streams.
3. Do you serve HTTP/3 at the edge? QUIC is UDP/443. Your edge needs to terminate it (Cloudflare and Fastly do). If you accept HTTP/3 inside your VPC too, your L4 has to forward UDP. Most NLBs and ALBs don't, at least not for the H3 handshake the way you want.
4. Do you need sticky sessions? Cookie-based stickiness is L7-only. IP-hash stickiness at L4 looks fine until your client base sits behind a corporate NAT and 4000 users hash to the same backend.
5. Who owns the WAF? If it's the edge (Cloudflare WAF, AWS WAF on CloudFront), origin doesn't need L7 for security; it needs L7 for routing. If it's origin, you're stuck with L7 there and need to decide between ALB+WAF or Envoy+a WAF filter chain.
You answer those five. The stack follows.
L4 in 2026: smaller footprint than you think
L4 still wins, but on a narrower set of jobs than the old wisdom suggests:
- Raw TCP and UDP that isn't HTTP. Postgres replicas behind an NLB. Redis Cluster. MQTT brokers. Game-server UDP fan-out. All L4.
- mTLS pass-through to the service. A fintech that needs cert validation inside the service for audit reasons can't terminate at the edge or LB. NLB or Envoy in TCP-proxy mode forwards the encrypted bytes.
- QUIC fan-out if you ever do origin-side HTTP/3, which most teams don't but some CDNs themselves do.
- Very high connection counts with cheap per-conn cost. Game lobbies, IoT, financial market data feeds.
Outside those, L4 is the wrong tool. The advice "use L4 for performance" doesn't apply when your bottleneck is anywhere except per-packet cost, and it almost never is.
An AWS NLB target group for a Redis Cluster looks like this. Note protocol: TCP, proxy_protocol_v2: true so the backend sees the client IP through the L4 hop, and preserve_client_ip: true for the same reason on instance targets:
TargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: redis-cluster-tg
Port: 6379
Protocol: TCP
TargetType: ip
VpcId: vpc-0abc123
HealthCheckProtocol: TCP
HealthCheckIntervalSeconds: 10
HealthyThresholdCount: 2
UnhealthyThresholdCount: 2
TargetGroupAttributes:
- Key: proxy_protocol_v2.enabled
Value: 'true'
- Key: preserve_client_ip.enabled
Value: 'true'
- Key: deregistration_delay.timeout_seconds
Value: '30'
The 30-second deregistration is the gotcha. The default is 300 seconds, which means a rolling Redis deploy keeps draining nodes for five minutes and your MOVED redirect storms hang around. Drop it to 30s for state-light services and re-check for state-heavy ones.
L7 in 2026: where the real work happens
L7 owns everything that needs the request, not just the connection:
- Path and header-based routing to different services from one entry.
- gRPC streaming with proper HTTP/2 upstream.
-
Retries with idempotency awareness. Only retry
GET/PUTor methods marked safe, never blindly retryPOST. - Outlier detection and circuit breaking at the per-upstream level.
- Request hedging for low-latency reads.
- Header rewriting, JWT validation, rate limiting per route.
The Envoy snippet below routes /api/v1/orders to one cluster, /grpc.OrderService/* to a gRPC cluster, and applies a per-route retry policy that respects idempotency. Notice the http2_protocol_options block on the upstream cluster. Without it your gRPC streams will downgrade and silently break:
static_resources:
listeners:
- name: main
address:
socket_address: { address: 0.0.0.0, port_value: 8080 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress
route_config:
virtual_hosts:
- name: app
domains: ["*"]
routes:
- match: { prefix: "/grpc.OrderService/" }
route:
cluster: orders_grpc
timeout: 0s
idle_timeout: 600s
- match: { prefix: "/api/v1/orders" }
route:
cluster: orders_rest
retry_policy:
retry_on: "5xx,reset,connect-failure"
num_retries: 2
per_try_timeout: 2s
retriable_request_headers:
- name: ":method"
string_match:
safe_regex:
regex: "GET|PUT|DELETE"
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: orders_grpc
type: STRICT_DNS
connect_timeout: 1s
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {} # required for gRPC upstream
outlier_detection:
consecutive_5xx: 5
interval: 10s
base_ejection_time: 30s
max_ejection_percent: 50
- name: orders_rest
type: STRICT_DNS
connect_timeout: 1s
lb_policy: LEAST_REQUEST
The retriable_request_headers block is the part most teams skip and then regret. Without it, Envoy will happily retry a POST /api/v1/orders that timed out, and now the customer has two orders.
For an L7 routing snippet at the edge, here's a Cloudflare Worker that splits traffic by path and adds an mTLS-aware header before forwarding to origin:
// cloudflare worker: edge L7 routing
export default {
async fetch(request, env) {
const url = new URL(request.url);
// gRPC streaming requests skip the worker logic
// (they shouldn't be hitting this hostname anyway)
if (request.headers.get("content-type")?.startsWith("application/grpc")) {
return new Response("use grpc.example.com", { status: 421 });
}
// admin paths go to a different origin, with stricter WAF rules
if (url.pathname.startsWith("/admin")) {
return fetch(`https://admin-origin.example.com${url.pathname}`, {
method: request.method,
headers: { ...request.headers, "x-edge-route": "admin" },
body: request.body,
});
}
// everything else: re-encrypt to ALB, add client cert header
const cert = request.cf?.tlsClientAuth?.certIssuerDN;
const headers = new Headers(request.headers);
if (cert) headers.set("x-client-cert-issuer", cert);
headers.set("x-edge-route", "default");
return fetch(`https://origin.example.com${url.pathname}${url.search}`, {
method: request.method,
headers,
body: request.body,
});
},
};
And for the team still on NGINX, the upstream block that actually works for HTTP/2-upstream gRPC plus health checks. NGINX-OSS doesn't ship active health checks (commercial only), so passive ejection via max_fails is what you get:
upstream orders_grpc {
# max_fails before passive ejection, 30s cool-off
server 10.0.1.10:9000 max_fails=3 fail_timeout=30s;
server 10.0.1.11:9000 max_fails=3 fail_timeout=30s;
server 10.0.1.12:9000 max_fails=3 fail_timeout=30s backup;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name api.example.com;
location /grpc.OrderService/ {
# must be grpc_pass, NOT proxy_pass, for streams to work
grpc_pass grpcs://orders_grpc;
grpc_read_timeout 600s; # bidi streams need this
grpc_send_timeout 600s;
error_page 502 = /grpc_fallback;
}
}
grpc_pass vs proxy_pass is the bug that takes a team two days to find. proxy_pass will accept the connection and even return the first frame. The streaming side just stops working at random later.
Envoy, ALB, Cloudflare, NGINX: scored honestly
The four tools in the typical 2026 stack, scored against the five questions above. One-line caveats only:
| Tool | mTLS pass-through | gRPC streaming (HTTP/2 upstream) | HTTP/3 (UDP/QUIC) | Sticky sessions | WAF | Best at |
|---|---|---|---|---|---|---|
| Envoy | Yes (TCP proxy) | Yes (full bidi) | Yes (since 1.20) | Yes (cookie + ring hash) | Filter chain or ext | Service mesh, polyglot fleets |
| ALB | No (terminates) | Yes (HTTP/2 upstream) | No (NLB needed) | Yes (cookie) | AWS WAF integration | AWS-native HTTP shop |
| Cloudflare | Pass-through opt | Yes (proxied or pass-through) | Yes (QUIC native) | No native (Workers KV) | Cloudflare WAF | Edge, global, DDoS, bot mitigation |
| NGINX-OSS | Yes (stream{}) | Partial (grpc_pass only) | Yes (since 1.25) | Yes (cookie or IP-hash) | ModSecurity (extra) | On-prem, single-region origin |
The two caveats hiding inside that table: ALB doesn't do mTLS pass-through (it terminates and you get the client cert in a header), and NGINX-OSS lacks active health checks. Both are fine when you know about them, painful when you don't.
Three stacks that work in 2026
Stack A: Cloudflare → ALB → ECS, for HTTP-first SaaS
Edge does TLS, WAF, bot mitigation, and global routing. ALB does path-based routing to ECS services. ECS tasks speak HTTP/1.1 or HTTP/2 internally. Nothing fancy, ships fast, scales to mid-billions of requests/month before you outgrow it.
Sticky sessions live on the ALB target group with cookie-based stickiness. mTLS, if any, terminates at the edge and the client cert thumbprint is forwarded as Cf-Client-Cert-Sig. The internal hop between ALB and ECS uses HTTP, secured by VPC + security groups, not TLS. That's a tradeoff most teams make and most regulators accept.
Stack B: Cloudflare → Envoy mesh, for polyglot microservices
Edge still terminates TLS and does WAF. From the edge, traffic enters an Envoy ingress gateway, then flows through a service mesh (Istio, Linkerd, or hand-rolled Envoy sidecars) for east-west traffic.
This is the stack you want when you have 30+ services in 4 languages and need per-service retries, outlier detection, and observability without code changes. The mesh sidecars do the L7 work everywhere. The ALB-equivalent here is the Envoy ingress, configured almost identically to the snippet above.
The cost: every pod has a sidecar, every hop has Envoy overhead, your control plane is a real piece of infrastructure to operate. If you don't have 20+ services, this is overkill and a regular ALB beats it.
Stack C: Anycast L4 → Envoy with mTLS pass-through, for fintech/regulated
When a regulator says "the service validates the client certificate" (not the edge, not the LB, the service), you can't terminate TLS anywhere upstream. Anycast L4 (an NLB with cross-region anycast, or a hyperscaler equivalent) sits at the edge in TCP-proxy mode. Behind it, Envoy in tcp_proxy filter mode forwards the encrypted bytes to the destination service.
The Envoy here isn't doing L7 work. It's doing connection routing and observability. The actual TLS handshake, including client cert validation, happens in the service process. This is the only stack in this list where origin Envoy is L4, not because L4 is faster, but because the regulator demanded it.
The decision tree
Walk it top-to-bottom, take the first match:
- Non-HTTP traffic? (TCP, UDP, MQTT, raw Postgres) → L4 at the edge AND L4 at origin.
-
mTLS must terminate inside the service? → L4 pass-through end-to-end. Envoy in
tcp_proxymode at origin. -
gRPC streaming present? → L7 at origin, HTTP/2 upstream (Envoy or ALB, not NGINX-OSS with
proxy_pass). - HTTP/3 at the edge? → CDN with QUIC support (Cloudflare, Fastly). Origin can be HTTP/2 over TCP.
- Pure HTTP/1.1 or HTTP/2 REST? → L7 at edge for WAF and routing, L7 at origin for service routing. ALB or Envoy.
Most teams sit on rule 5. Most teams overbuild because they picked their LB before answering the five questions above.
When you get it wrong
Three failure modes that show up over and over:
The 60-second gRPC death. Team puts an ALB in front of a gRPC service. Bidi streams open. After exactly 60 seconds, every long-running stream RSTs. The team chases it for two days. Cause: ALB default idle timeout is 60 seconds. gRPC server-streaming RPCs that legitimately exceed 60s of idle (a chat that pauses, a server-sent event with quiet periods) get nuked. Fix: set the target-group attribute idle_timeout.timeout_seconds to something larger (1800 is common) and emit application-level keepalive pings on the stream.
The double-TLS spike. Cloudflare terminates TLS, re-encrypts to origin. Origin Envoy terminates TLS again, re-encrypts to upstream service. Service terminates TLS a third time. Each hop adds ~5 ms of handshake on cold connections and tens of microseconds per packet on warm ones. p99 starts climbing during deploys when connections are cold. Fix: connection pooling with keepalive on every hop, or better, fewer TLS terminations. Usually you can drop the origin Envoy to service hop to mTLS once and let the edge handle public TLS once.
The sticky-session bug behind corporate NAT. Team enables IP-hash stickiness on an NGINX L4 because their app needs session affinity. Works fine in QA. Production: 4000 employees of one enterprise customer sit behind a single NAT IP, they all hash to one backend, that backend melts. Fix: cookie-based stickiness at L7. If you can't, switch to a real session store and treat the LB as stateless.
The pattern across all three: a default that was correct for a different era. Always check the defaults before assuming the LB is fine.
If this was useful
The five-question framework above is one chunk of a larger pattern in the System Design Pocket Guide: Fundamentals. The book walks through the rest of the request path (edge, ingress, mesh, service-to-service) with the same "ask the right question per hop" lens, plus the failure modes that show up when you skip a question. If you found this useful, the chapter on traffic ingress and the one on service discovery extend exactly what's here.
Which hop is the messiest in your stack right now: edge, origin LB, or the service-to-service mesh? Drop the shape in the comments. Curious which of the three failure modes above is the one biting people most in 2026.

Top comments (0)