DEV Community

Sivagurunathan Velayutham
Sivagurunathan Velayutham

Posted on

# Beyond Round Robin: Building a Token-Aware Load Balancer for LLMs

In my previous experiment, I was trying to find the best model for a given task. The approach was to send the same request to multiple LLM models in parallel and return whichever responded first. Users got faster responses, but every request burned GPU cycles across multiple servers, most of which went to waste.

That raised an obvious question: instead of racing backends against each other, what if the load balancer could pick the right one upfront?

Why Traditional Load Balancing Breaks Down for LLMs

Standard load balancers route traffic using Round Robin, Least Connections, or health-based metrics. These strategies assume requests have roughly equal cost. That assumption breaks with LLMs.

A 10-token prompt ("Translate 'hello' to French") and a 4,000-token prompt ("Analyze this codebase") both count as one connection. Least Connections will happily stack three heavy prompts on one server while another sits idle. The result is head-of-line blocking on the overloaded node, and wasted capacity elsewhere.

Connection count is not a proxy for computational cost. Token count is.

The Insight

LLM inference has two phases: prefill (processing the input prompt) and decode (generating tokens sequentially). Prefill time scales directly with input token count. A 4,000-token prompt consumes significantly more GPU time during prefill than a 10-token one.

If the balancer can estimate token count before routing, it can maintain a running total of in-flight tokens per backend and route to the node with the lowest total. Same Least Loaded pattern used in distributed systems, but the metric is tokens instead of connections. The algorithm becomes: pick the backend where current_in_flight_tokens + new_request_tokens is the lowest.

Architecture

I built this as an L7 reverse proxy in Go, sitting between clients and a cluster of LLM backends.

Token Aware Load balancermermaid

The request lifecycle:

  1. Intercept the incoming JSON body and extract the prompt
  2. Tokenize using a tiktoken-compatible encoder
  3. Route to the backend with the lowest in-flight token count
  4. Increment that backend's token counter before proxying
  5. Forward the request through httputil.ReverseProxy
  6. Decrement the counter once the backend responds

I chose Go because net/http, httputil.ReverseProxy, and sync/atomic cover almost everything needed here. The only external dependency is tiktoken-go for tokenization.

The Body-Read Problem

In Go, r.Body is an io.ReadCloser. It can only be read once. The balancer needs to read it for tokenization and still forward the original payload to the backend.

The fix: read the body into a []byte, run the tokenizer against that slice, then reassign r.Body with io.NopCloser(bytes.NewReader(body)). The downstream proxy sees an intact body.

This is a well-known concern in any L7 proxy that inspects payloads, but it is easy to overlook when you are building one for the first time.

Separating Middleware from RoundTripper

Token aware load balancer splits its logic across two layers.

Middleware (http.Handler wrapper) handles request validation, error responses (400, 503), and stores the computed token count in the request context. Anything that might reject a request lives here.

RoundTripper (http.RoundTripper implementation) handles transport-level concerns: setting the destination URL and managing the token counter lifecycle. The decrement happens after the backend response is received, which maps naturally to the RoundTrip call boundary.

Results

I ran both strategies against the same setup: 3 backend servers where each simulates LLM compute time by sleeping proportionally to the input token count (±20% jitter to mimic real variance). Three payload sizes were used: small (~30ms), large (~2750ms), and huge (~7500ms). Traffic is mixed, with each request randomly picking a payload size.

High Contention (50% heavy, 50% small, concurrency=30, 60 requests)

Metric Round Robin Token Aware Improvement
Average Latency 2.58s 2.27s -12%
P90 Latency 8.60s 7.78s -10%

Heavy Workload (80% heavy, 20% small, concurrency=5, 60 requests)

Metric Round Robin Token Aware Improvement
Average Latency 4.45s 4.20s -6%
P90 Latency 8.67s 8.57s -1%

The gains are most visible under high contention. At concurrency=30, average latency drops 12% and P90 drops 10%. The reason is straightforward: small requests no longer get stuck behind heavy ones because the balancer routes by computational weight, not connection count.

A 12% improvement across 3 simulated backends is a floor, not a ceiling. Real workloads with wider token variance and higher concurrency would amplify the difference.

What's Next

This is a simplified implementation. Production systems would need health checks with automatic backend removal, streaming (SSE) support with per-chunk token tracking, output token estimation for more accurate load prediction, and observability through Prometheus or equivalent.

The code is on GitHub.

Top comments (0)