Gabriel Anhaia

Posted on May 24

Request Hedging: The Tail-at-Scale Technique Most Teams Skip

#systemdesign #performance #distributedsystems #reliability

Book: System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your p99 is 2 seconds. Your p50 is 80 milliseconds. That gap is what wakes you up. It's also what your SLO actually measures, and the tricks you've been reaching for (bigger pods, more replicas, a fatter cache) barely move it.

Jeff Dean and Luiz Barroso wrote about this in 2013 in a paper called The Tail at Scale (CACM, Vol. 56, No. 2). The headline finding: in a fanout system where every request touches 100 backends, a per-backend p99 of 10ms produces an end-to-end p99 closer to 140ms. Tail latency multiplies. You can't outrun that by tuning the average.

One of the techniques they shipped at Google to fix it (request hedging) is dead simple, well understood, and somehow still missing from most production codebases in 2026. So let's put it back on the table.

The tail-at-scale problem still applies

The math hasn't changed since 2013. If a single backend has a 1% chance of being slow on any given request, a request that fans out to 100 backends has a 1 - (0.99)^100 ≈ 63% chance of hitting at least one slow backend. Your end-to-end latency is the max across all backends, not the average.

The same effect shows up in non-fanout systems too. A user-facing request that depends on three serial calls (auth, profile, recommendations) sees its p99 stack. Each hop has its own bad day, and the bad days line up more often than your intuition expects.

Causes haven't changed either: GC pauses, hot keys, kernel scheduling glitches, contended locks, noisy neighbors on shared hardware, NIC queue backpressure. You can fix individual sources, but you'll never zero them out. The tail is structural.

That's why Dean and Barroso framed the solution as latency-tolerant techniques rather than latency-reduction tricks. Hedging is the one you should ship first.

What hedging actually is

The pattern: send the request to backend A. If you haven't gotten a response by the time you'd expect a slow-but-not-broken reply (say, your p95), send the same request to backend B. Take whichever response comes back first. Cancel the loser.

That's it. No retries on failure. No queue. You're betting that the slow tail of one backend isn't correlated with the slow tail of another, which is usually true when the slowness comes from GC, kernel, or local hot-spot causes.

The Dean/Barroso paper reports that hedging at the 95th percentile reduced their BigTable lookup p99 from 1800ms to 74ms, while only inflating total backend work by about 2%. Those numbers depend on your workload, but the shape holds.

The cost of hedging

You're sending extra requests, so you're doing extra work. The question is how much.

If you hedge at the p95, then by definition only 5% of requests trigger a second call. Of those, the second call usually completes faster than the first would have, but you still pay for one extra request 5% of the time. That's where the 5–10% overhead figure comes from, and it assumes you cancel the loser fast enough that the loser stops working as soon as the winner returns.

If you hedge at the p50, you've doubled your backend traffic. Don't do that.

If you hedge at the p99 you're too late. The user has already noticed.

The threshold matters. Measure it from production data, not from a load test.

A working Go implementation

The minimum viable Go version uses two goroutines, a buffered channel, and context.WithCancel. Real, no foo/bar:

package hedged

import (
    "context"
    "errors"
    "time"
)

type Result struct {
    Body []byte
    Err  error
}

// Fetch issues the primary request, then fires a hedge after p95 latency.
// The first non-error response wins. The loser is cancelled.
func Fetch(parent context.Context, p95 time.Duration, call func(context.Context) ([]byte, error)) ([]byte, error) {
    ctx, cancel := context.WithCancel(parent)
    defer cancel() // always cancel the loser, plus tear down on parent cancel

    results := make(chan Result, 2) // buffered so the loser's send never blocks

    go func() {
        body, err := call(ctx)
        results <- Result{body, err}
    }()

    hedgeTimer := time.NewTimer(p95)
    defer hedgeTimer.Stop()

    select {
    case r := <-results:
        // primary returned before the hedge fired
        return r.Body, r.Err
    case <-hedgeTimer.C:
        // primary is slow; fire the hedge
    case <-ctx.Done():
        return nil, ctx.Err()
    }

    go func() {
        body, err := call(ctx)
        results <- Result{body, err}
    }()

    // take the first response that isn't a context-cancel artefact
    for i := 0; i < 2; i++ {
        select {
        case r := <-results:
            if r.Err == nil {
                return r.Body, nil
            }
            // loser may have errored because we cancelled it; ignore and wait
            if errors.Is(r.Err, context.Canceled) {
                continue
            }
            return nil, r.Err
        case <-parent.Done():
            return nil, parent.Err()
        }
    }
    return nil, errors.New("hedged: both attempts failed")
}

The defer cancel() line does the heavy lifting. The moment the function returns (winner found, error raised, or parent cancelled), the still-in-flight loser sees its context die and should abort whatever it was doing. If call is a net/http request built with http.NewRequestWithContext, the TCP socket closes within microseconds. That's how you keep the overhead at single-digit percent instead of double.

One subtle bit: the results channel is buffered to 2 so the loser's send doesn't deadlock when nobody's reading. Skip that and you leak goroutines.

A working Python implementation

Same pattern in asyncio. The trick is asyncio.wait with FIRST_COMPLETED, then explicit cancellation of the pending task:

import asyncio
from typing import Awaitable, Callable, TypeVar

T = TypeVar("T")

async def hedged_call(
    p95: float,
    call: Callable[[], Awaitable[T]],
) -> T:
    primary = asyncio.create_task(call())

    # wait up to p95 for the primary
    done, _pending = await asyncio.wait({primary}, timeout=p95)
    if primary in done:
        return primary.result()

    # primary is slow; fire the hedge
    hedge = asyncio.create_task(call())
    done, pending = await asyncio.wait(
        {primary, hedge},
        return_when=asyncio.FIRST_COMPLETED,
    )

    # cancel the loser; await it so we don't leak the task
    for task in pending:
        task.cancel()
    for task in pending:
        try:
            await task
        except (asyncio.CancelledError, Exception):
            pass  # loser exceptions are uninteresting

    winner = next(iter(done))
    return winner.result()

If call() wraps an httpx.AsyncClient.get, task.cancel() propagates down and closes the connection. If it wraps a requests.get running in a thread executor, cancellation is advisory and the thread keeps working, which is exactly the case where hedging makes your load worse, not better. Use async clients all the way down or skip hedging at this layer.

Hedging at the load balancer

You don't have to write any of the above if your traffic flows through Envoy. The hedge_policy on a route does it for you:

route_config:
  name: profile_service
  virtual_hosts:
    - name: profile
      domains: ["profile.internal"]
      routes:
        - match: { prefix: "/" }
          route:
            cluster: profile_cluster
            timeout: 1s
            retry_policy:
              retry_on: "5xx,reset"
              num_retries: 2
              per_try_timeout: 0.4s
            hedge_policy:
              initial_requests: 1
              additional_request_chance:
                numerator: 100
                denominator: HUNDRED
              hedge_on_per_try_timeout: true

The hedge_on_per_try_timeout: true flag is the one that matters. When the per-try timeout (400ms here) fires, Envoy issues a second request to a different upstream and races them. Pair this with a per_try_timeout set to your measured p95 and you've got hedging without writing a line of Go or Python.

Istio inherits this through its Envoy dataplane. If you're on a service mesh you may already have the primitive sitting there, unused.

The application-layer version is more flexible. You can hedge selectively based on request type, vary the threshold per endpoint, or hedge against a different backend entirely. The mesh version is easier to ship and harder to get wrong. Pick based on whether you need that flexibility.

When NOT to hedge

A short list of places hedging will hurt you:

Non-idempotent operations. POST /payments, PUT /counter/increment, anything that mutates state. Two requests means two payments. There is no clever idempotency-key trick that makes this safe by default. You have to engineer the dedup explicitly and you usually shouldn't bother.
Expensive backends. If each call costs you GPU seconds or a $0.02 LLM token bill, hedging means paying twice for 5% of requests. Do the math on your unit economics.
Backends with their own cascading retries. If service B retries service C three times before returning, your hedge fires a second tree of retries. The amplification gets ugly fast.
Stateful sessions. WebSockets, gRPC streaming, anything sticky. Hedging assumes the request is a pure function of its input.

The rule of thumb: hedge read paths that hit shared backends. Don't hedge anything else.

The outage gotcha: pair it with a circuit breaker

This is the part everyone misses, and it's the part that turns hedging from a tail-latency win into an outage amplifier.

When a backend gets unhealthy, every request crosses your p95 threshold. Every request triggers a hedge. You've just doubled your traffic to a backend that was already failing. The hedges hit the same dying instances. They time out too. You retry. The cluster dies faster.

Dean and Barroso flagged this in the paper. The mitigation is non-optional: hedging must be coupled with a circuit breaker that opens when the failure rate or hedge rate crosses a ceiling.

Concretely:

Track your hedge rate. If it exceeds, say, 15% over a 10-second window, stop hedging entirely until it drops below 10%. The percentages are workload-specific; pick numbers that mean "this isn't tail latency, this is an outage."
Track downstream failure rate. Standard circuit-breaker behaviour: open on errors, half-open to probe, close on recovery. The hedge logic should only fire when the breaker is closed.
Cap concurrent hedged requests. A bounded semaphore that drops the hedge attempt (not the primary) when saturated.

In code, the simplest form wraps the hedge call in a breaker check:

func FetchSafe(ctx context.Context, p95 time.Duration, call func(context.Context) ([]byte, error)) ([]byte, error) {
    if !breaker.Allow() {
        // circuit open: skip hedging entirely, just run the primary
        return call(ctx)
    }
    body, err := Fetch(ctx, p95, call)
    breaker.Record(err == nil)
    return body, err
}

Envoy's hedge policy doesn't pair this for you automatically. You configure the outlier detection separately, and you need both: outlier detection ejects bad upstreams, hedging covers the tail of the healthy ones.

What to ship this week

Measure your real p95 per endpoint. Pick the three highest-fanout read paths in your system. Add hedging at the p95 threshold with a circuit breaker. Watch your p99 drop and your overhead stay under 10%.

If you're on Envoy or Istio, ship the hedge_policy first — it's a config change, no code. If you're in a service that calls a database or a downstream API directly, ship the application-layer version with proper context cancellation. Either way, instrument the hedge rate as a first-class metric. The day it spikes is the day you'll be glad you wired up the breaker.

The 2013 paper called this "good enough" engineering against unavoidable variability. Twelve years on, that's still the right framing. You can't make the tail go away. You can race it.

What's the p99/p50 gap on your hottest read path right now, and what's stopping you from shipping a hedge against it this week?

If this was useful

Hedging sits in a family of patterns (timeouts, retries, circuit breakers, load shedding, backpressure) that decide whether your system survives its own scale. The System Design Pocket Guide: Fundamentals walks through the trade-offs in the latency and reliability chapters, with the same level of "show me the actual config" that this post tried to hit.