AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm

#ai #llm #devops #machinelearning

What: AMD shipped ATOM + ATOMesh, a ROCm-native LLM serving stack whose headline trick is prefill/decode disaggregation — splitting the two phases of inference onto separate pools of GPUs instead of crowding them onto one.

Why: Prefill and decode have opposite bottlenecks — prefill is compute-bound, decode is memory-bandwidth-bound — so running them on the same worker wastes hardware and lets one long prompt stall everyone else's token stream.

vs prior: A co-located server (vanilla single-pool vLLM) interleaves prefill and decode on the same GPUs; disaggregation runs each on its own pool tuned for its bottleneck, paying for it by shipping the KV cache across the interconnect between them.

Think of it as

A restaurant kitchen that splits the prep station from the plating line.

   ORDER (the prompt)
        │
        ▼
  ┌──────────────┐    KV cart      ┌──────────────┐
  │ PREP STATION │   down the      │ PLATING LINE │
  │  (prefill)   │═══ hallway ════▶│   (decode)   │──▶ tokens
  │ compute-heavy│  (KV transfer)  │ memory-bound │
  └──────────────┘                 └──────────────┘
   chops a whole                    plates dishes
   order at once                    one at a time

prefill = the prep cook chopping a whole order's ingredients in one compute-heavy burst
decode = the plating cook building dishes one at a time, back to the fridge each plate
KV cache = the fridge of prepped ingredients every plate reaches into
disaggregation = giving prep and plating their own stations and staff, each tuned to its job
KV-cache transfer = wheeling the prep cart down the hallway from prep to the plating line
KV-aware scheduling = sending each order to the line whose fridge already holds its prep

Quick glossary

Prefill — The first phase of inference: the model reads your entire prompt in parallel in one pass, building the KV cache. It does a lot of math per byte of memory it touches, so it is compute-bound.

Decode — The second phase: the model generates one output token at a time, and each step must read the whole KV cache plus all the weights to produce that single token. It moves a lot of memory for little math, so it is memory-bandwidth-bound.

KV cache — The stored keys and values for every token already processed, so the model never recomputes them. It is the dominant memory cost of inference — and, in a disaggregated stack, the thing that has to travel from the prefill pool to the decode pool.

Compute-bound vs memory-bound — The roofline distinction: a job is compute-bound when the GPU's math units are the limit, and memory-bound when memory bandwidth is. Prefill and decode sit on opposite sides of that line, which is the whole reason to split them.

Disaggregation — Running prefill and decode on separate pools of workers instead of one shared pool, so each pool can be sized and scheduled for its own bottleneck.

KV-aware scheduling — A scheduler that routes a request with knowledge of where its KV-cache blocks already live — so it can reuse a cached prefix (prefix caching) or steer a request to the worker that avoids a transfer.

ROCm / AITER / MORI / Instinct — ROCm is AMD's CUDA-equivalent software stack and Instinct its datacenter GPU line. AITER supplies the optimized ROCm kernels (the analogue of CUDA kernels), while MORI handles the distributed, RDMA-style communication for tensor/expert parallelism (AMD's own collective library, RCCL, is the closer NCCL analogue).

The news. On June 16, 2026, AMD published ATOM + ATOMesh, a paired ROCm-native LLM serving stack for Instinct GPUs, shipped as an early (alpha) preview. ATOM is an AITER-optimized inference engine (kernel acceleration via AITER, distributed communication via MORI); ATOMesh is the orchestration layer on top — it exposes an OpenAI-compatible API, manages multiple engine backends, and applies prefill/decode disaggregation and KV-aware scheduling, evaluated serving DeepSeek-V4-Pro on Instinct hardware. In AMD's framing it deliberately mirrors the vLLM/SGLang design — the same serving primitives, now on AMD silicon. Read the release →

Picture a restaurant kitchen where one cook does everything. First they prep an order — chopping, slicing, mixing every ingredient the dish needs, all at once, in a furious burst of knife work. Then they plate it — assembling the dish one component at a time, walking back to the fridge for each piece. Prep is a flat-out, hands-busy job; plating is a lot of trips to the fridge and not much knife work. Cram both onto one cook and they fight: a big prep order makes every waiting plate go cold, and during the slow plating trips the knives sit idle. That single overloaded cook is one GPU running an LLM, and the two jobs are prefill and decode.

When a model answers, it first runs prefill: it reads your entire prompt in one parallel pass, doing dense matrix math and filling the KV cache. Then it runs decode: it emits output one token per step, and every step drags the whole KV cache and all the weights out of memory to produce that single token. Prefill is compute-bound — limited by the GPU's math units — while decode is memory-bandwidth-bound, limited by how fast it can stream the cache out of memory. They are the prep cook and the plating cook: opposite appetites, forced to share one station.

That opposite-appetites problem is why a single shared worker wastes hardware. Pack prefill and decode together and a long prompt's prefill burst blocks the queue of decode steps behind it — a head-of-line stall — while the memory-bound decodes leave the expensive compute units sitting idle. You can never shape one machine to be right for both jobs at once.

Disaggregation is the fix: give prep and plating their own stations. Prefill runs on one pool of GPUs, scheduled for compute-heavy bursts; decode runs on a separate pool, scheduled for steady memory-bound streaming with large batches. When a request finishes prefill, the prefill worker hands its KV cache across the interconnect to a decode worker, which then streams the tokens out. Each pool is now sized and tuned for the one bottleneck it actually has — and AMD's ATOMesh is the orchestration layer that does exactly this routing on ROCm. This is the same playbook vLLM and SGLang made standard; ATOM + ATOMesh shows AMD building a ROCm-native path to it.

But disaggregation is not free, and the bill comes due at the handoff. After prefill, the KV cache has to physically travel from the prefill pool to the decode pool. For a 70B-class model with a 2,048-token prompt, that cache is 2 × 80 layers × 8 KV-heads × 128 dim × 2,048 tokens × 2 B ≈ 0.67 GB (illustrative, Llama-3.1-70B with grouped-query attention). Move it over PCIe 4.0 and you pay roughly 21 ms; over NVLink, about 0.75 ms — a ~28× gap (all three figures illustrative: the size is from the formula above, the times are set by each interconnect's bandwidth, none measured on ATOM). That gap is why disaggregated stacks live or die by their interconnect — and why KV-aware scheduling tries to dodge the transfer entirely, steering a request to a worker that already holds its prefix.

Phase	What it processes	Bottleneck (roofline)	What it wants from the hardware
Prefill	The whole prompt, in one parallel pass	Compute-bound — high arithmetic intensity	Raw matmul throughput; fewer, fatter GPUs
Decode	One output token per step, reading the full KV cache	Memory-bandwidth-bound — low arithmetic intensity	Memory bandwidth and large batches to amortize the weight reads

The honest caveat: ATOM + ATOMesh ship as an early (alpha) preview, and AMD's post describes the mechanism, not head-to-head numbers — it reports that ATOMesh mirrors the vLLM/SGLang design and was evaluated serving DeepSeek-V4-Pro, but it does not give usable numeric throughput or latency figures in the post text, so treat any performance claim as not yet quantified here and check the source for benchmarks. The KV-transfer figures above are illustrative, sized to a representative model rather than measured on ATOM. But the durable lesson stands: once you see that prefill and decode sit on opposite sides of the roofline, "one GPU does both" stops looking efficient — and a serving stack's real job is to split the two phases and move the KV cache between them cheaply.

Goes deeper in: LLM Serving → Prefill/Decode Disaggregation → Disaggregation

Related explainers

SGLang v0.5.12 — TokenSpeed MLA backend — SGLang is one half of the vLLM/SGLang design ATOMesh mirrors; this is the engine-level optimization that lives inside a pool like ATOM's.
HuggingFace — Async continuous batching — the other lever for keeping decode workers busy; disaggregation and continuous batching are complementary ways to fight the same memory-bound decode problem.
Tangram — Per-head KV cache budgets — shrinks the KV cache itself, which is exactly the payload a disaggregated stack has to transfer between pools.
Spec-decode latency — Load-dependent latency model — models how decode latency moves with load, the phase disaggregation isolates onto its own pool.

FAQ

What is prefill/decode disaggregation?

It is a serving design that runs the two phases of LLM inference on separate pools of GPUs. Prefill — reading the whole prompt in one parallel, compute-heavy pass — runs on one pool, and decode — generating output one token at a time, bottlenecked by memory bandwidth — runs on another. After prefill, the request's KV cache is transferred across the interconnect to a decode worker. Splitting them lets each pool be sized and scheduled for its own bottleneck instead of compromising on one shared machine.

Why split prefill and decode onto separate GPUs?

Because they have opposite bottlenecks. Prefill is compute-bound (limited by the GPU's math units), while decode is memory-bandwidth-bound (limited by how fast it streams the KV cache and weights out of memory). On one shared worker a long prefill stalls the decode steps queued behind it, and the memory-bound decodes leave the compute units idle. Running each phase on hardware tuned for its own limit avoids that mutual interference — at the cost of moving the KV cache between the two pools.

What do AMD's ATOM and ATOMesh add, and how do they relate to vLLM and SGLang?

ATOM is a ROCm-native inference engine (optimized kernels via AITER, cross-GPU communication via MORI) and ATOMesh is the orchestration layer above it — an OpenAI-compatible API that applies prefill/decode disaggregation and KV-aware scheduling. AMD describes it as deliberately mirroring the vLLM/SGLang design, so the contribution is not a new algorithm but the same modern serving primitives brought to AMD Instinct GPUs — a second-vendor implementation of the stack the LLM Serving track teaches.

Originally posted on Learn AI Visually.