SGLang v0.5.14: LPLB Expert-Parallel Load Balancing

#ai #llm #devops #machinelearning

What: The SGLang v0.5.14 release ships LPLB — a linear-programming load balancer for serving mixture-of-experts models, where the experts are split across many GPUs and each step routes every token to a few of them.

Why: In expert-parallel MoE serving, token routing is uneven and shifts every step, so one overloaded GPU stalls the whole step at a sync barrier; evening that load is what unlocks throughput on big MoE models like DeepSeek-V4.

vs prior: Earlier setups used static, hand-tuned expert placement and ate the imbalance; LPLB keeps redundant replicas of the hot experts and solves a small linear program each step to minimize the busiest GPU's share of the work.

Think of it as

A warehouse store opening duplicate counters to even out the longest line.

       40% of this step's tokens want one hot expert

   WITHOUT LPLB                 WITH LPLB (3 replicas)
   ┌──────────────┐             ┌──────────────┐
   │ GPU1 ####### │ 40%         │ GPU1 ##      │ 14%
   │ GPU2 #       │  5%         │ GPU2 ##      │ 14%
   │ GPU3 #       │  5%         │ GPU3 ##      │ 14%
   └──────┬───────┘             └──────┬───────┘
          ▼                            ▼
   barrier waits on GPU1        lanes finish together
   ✗ others idle ~1/3 step      ✓ idle time deleted

a customer = a token routed to its experts this step
a specialty counter = an expert (a sub-network in a mixture-of-experts model)
a checkout lane = a GPU the experts are spread across
one counter mobbed while others sit idle = per-GPU load imbalance
duplicate copies of the busy counter = redundant expert replicas
the floor manager who evens the longest line each wave = LPLB

Quick glossary

MoE (Mixture-of-Experts) — A model whose feed-forward layer is split into many experts (sub-networks); a small router sends each token to only a few. Total parameters are huge, but the active ones per token stay small. DeepSeek-V4 is a large MoE model.

Expert parallelism (EP) — The serving layout that spreads a MoE's experts across many GPUs, because all the experts together do not fit on one. Each step, tokens must be shipped to whichever GPU holds their chosen expert and the results shipped back.

Load imbalance — When this step's router sends far more tokens to some experts than others, the GPUs holding the popular experts get swamped while the rest sit idle. The pattern is data-dependent, so it shifts batch to batch.

Redundant expert replicas — Keeping extra copies of the hot experts on several GPUs so their token load can be split, instead of one GPU owning a popular expert alone. The balancer decides how to divide each expert's tokens among its copies.

LPLB — SGLang's Linear-Programming Load Balancer. Each step it solves a tiny linear program over the current token counts to assign load across replicas so the maximum per-GPU load is as small as possible (a min-max objective).

Waterfill — The second expert-parallel balancer the release ships alongside LPLB. SGLang names it but does not detail how it works; the name points to a classic water-filling heuristic — fill the least-loaded replica first — which would be a lighter alternative to solving the LP each step.

All-to-all — The expert-parallel communication step that ships tokens out to their experts' GPUs and the results back. It runs every layer and waits for the slowest GPU, which is why imbalance is so costly here.

The news. On June 26, 2026, the SGLang team released v0.5.14, with work from 56 contributors. The headline is 5x higher throughput at the same interactivity serving DeepSeek-V4 on NVIDIA GB300, driven by two new expert-parallel load balancers — Waterfill and LPLB (a linear-programming load balancer) — plus CuteDSL prefill kernels for Blackwell and int8 checkpoint pooling for linear-attention prefix caches. Read the release →

Picture a warehouse store at peak rush. The checkout lanes are the GPUs; the specialty counters — deli, pharmacy, bakery — are the model's experts, and because no single lane can hold them all, the store spreads the counters across the lanes. That spread is expert parallelism: a mixture-of-experts model has too many experts to fit on one GPU, so they live across many, and each decode step the router sends every customer (token) to the one or two counters they need. The trouble is that the rush is lumpy. This wave, everyone wants the deli; next wave, the pharmacy. So one counter gets mobbed while the rest stand idle — and the store can't close out the rush until that longest line clears.

That last clause is the whole problem, because the lanes do not finish independently. Every GPU has to meet at a sync barrier — the all-to-all that ships tokens to their experts and the answers back — and that barrier waits for the slowest lane. The GPU holding this step's most popular expert therefore sets the pace for all of them, and the fast lanes burn the difference as idle time. Add more GPUs and the imbalance can get worse, not better, because the hot expert still lives on one lane while you have paid for more lanes to stand around.

SGLang v0.5.14's fix is to stop letting one counter bottleneck the floor. It keeps redundant replicas of the hot experts — duplicate deli counters on several lanes — and then, each wave, the floor manager solves a quick assignment problem: given how many customers want each counter right now, divide every counter's line across its copies so the busiest lane does as little as possible. That floor manager is LPLB, and "as little as possible" is literal: it solves a small linear program whose objective is to minimize the maximum per-GPU load (a min-max). Waterfill is the other balancer the release pairs it with, and SGLang does not spell out how it works. The name, though, points to a classic water-filling heuristic — fill the least-loaded replica first — which would be a lighter alternative to running the LP every step.

Hold the layout fixed and walk the imbalance math (illustrative — the release reports only the end-to-end 5x). Say 8 GPUs serve a batch, and the router sends 40% of this step's tokens to one hot expert that lives on a single GPU, while another GPU draws just 5%. The step can't end until that one GPU finishes its 40%, so the other seven idle for roughly a third of the step — you own 8 GPUs but move at the speed of the busiest one. Now place 3 replicas of that hot expert and let LPLB split its tokens across them: its share per GPU falls from 40% toward about 14%, the barrier wait shrinks sharply, and the lanes finish much closer together. The win isn't a faster kernel — it's deleting the idle time that imbalance was manufacturing.

Expert-parallel balancing	How it assigns load	Per-step cost	Balance quality
Static / hand-tuned placement	fixed expert→GPU map, set before serving	~none	poor under shifting, data-dependent routing
Waterfill (this release)	the release's second balancer; name implies water-filling, internals not detailed	—	a lighter companion to LPLB (inferred from the name)
LPLB (this release)	solves a linear program to minimize the busiest GPU's load	a small solve each step	tightest — a min-max optimum over replicas (SGLang v0.5.14)

Where it earns its keep is exactly the regime DeepSeek-V4 lives in: a large MoE served with expert parallelism across many Blackwell GPUs, where the all-to-all and its sync barrier are a leading cost in each decode step. The release's headline — 5x higher throughput at the same interactivity — is a goodput claim: more tokens per second without making any single user wait longer. Read it as the lanes finishing together instead of seven of them waiting on one — the same hardware, far less idle time.

Related explainers

SGLang v0.5.12 — TokenSpeed MLA backend — the prior SGLang release, a kernel-level cache-write win rather than a scheduling one
Manifold Power Iteration — MoE router alignment — the other MoE balance problem: which expert a token picks (router design), not where that expert runs
GLM-5.2 — active vs total parameters — why MoE serving is its own discipline: huge total weights, small active compute per token

FAQ

What is LPLB (linear-programming load balancing)?

LPLB is the Linear-Programming Load Balancer added in SGLang v0.5.14. When a mixture-of-experts model is served with expert parallelism — its experts split across many GPUs — the router sends an uneven, step-by-step-changing number of tokens to each expert, so some GPUs get swamped while others idle. LPLB keeps redundant replicas of the hot experts and, each step, solves a small linear program over the current token counts to divide every expert's load across its replicas so the maximum per-GPU load is minimized. Evening the load shrinks the wait at the all-to-all sync barrier that gates each decode step.

Why does expert-parallel MoE serving need load balancing at all?

Because expert parallelism makes the GPUs finish a step together, not independently. Every layer runs an all-to-all that ships tokens to their experts' GPUs and the results back, and that barrier waits for the slowest GPU. Since token-to-expert routing is data-dependent and shifts every batch, whichever GPU holds this step's most popular expert becomes the bottleneck for all of them — and the rest burn the difference as idle time. Without balancing, adding more GPUs can even make it worse, because the hot expert still lives on one GPU. SGLang reports a 5x throughput gain at the same interactivity for DeepSeek-V4 on NVIDIA GB300 once the load is evened.

How does LPLB differ from Waterfill, and from a MoE router?

Waterfill and LPLB are the two expert-parallel balancers the release ships, both aimed at spreading each step's token load across expert replicas. SGLang details LPLB — it solves a linear program for a tight min-max balance at a small per-step cost — but does not spell out Waterfill's internals; the name points to a classic water-filling heuristic (fill the least-loaded replica first), which would be a lighter alternative to an LP solve. Both differ from the MoE router: the router decides which expert each token should go to (a quality choice about the model's output), whereas the balancers decide where, among the redundant copies of that chosen expert, the work actually runs (a serving choice about GPU utilization).

Originally posted on Learn AI Visually.