Training gets the headlines. Inference gets the bill. If you run LLMs in production, inference is almost certainly your biggest AI line item — a meter running 24/7 on every request. The gap between naive and optimized serving is routinely 5-10x in cost and 3-5x in latency.
The bottleneck is memory, not compute
During token generation, LLM inference is memory-bandwidth bound. An H100 has ~3.35 TB/s bandwidth but ~989 TFLOPS FP16 compute — during autoregressive decoding you're using only ~10-20% of that compute, waiting on weights and KV-cache to stream from memory. Every optimization attacks the same root cause: move less data, use it better.
The levers that matter
- KV cache. It's often bigger than the weights. PagedAttention (vLLM) pages the cache like OS virtual memory, dropping waste from 60-80% to near-zero → 2-3x more concurrent requests. Prefix caching reuses the KV for shared system prompts / few-shot / RAG context. GQA shrinks the cache at the architecture level.
- Continuous batching. Swap finished sequences out and new ones in every step — utilization goes from ~20-30% to 80-90%. The single biggest throughput win, and why vLLM / SGLang / TensorRT-LLM exist.
- Quantization. FP8 / INT8 / INT4 (AWQ, GPTQ) move less data → lower cost and latency, smaller footprint. 8-bit is near-lossless for most tasks; 4-bit is often an acceptable trade. Validate on your own eval set.
- Speculative decoding. A small draft model proposes tokens; the big model verifies in one pass → lower latency with no quality loss.
- Right-size the model. The cheapest token is the one you never compute on an oversized model — route easy requests down, hard ones up.
In practice
Use a real serving framework (vLLM, SGLang, TensorRT-LLM) rather than hand-rolling. Measure your actual prompt/response shapes first — long shared prefixes favour prefix caching, high concurrency favours batching, long outputs favour KV-cache and quantization work. Track cost-per-1k-tokens, throughput, and tail latency — the numbers the business actually feels.
Inference optimization is where AI economics are won or lost. The techniques are well understood and together routinely cut serving cost 5-10x — often the deciding factor in whether an AI feature ships at all.
Full version on the VSBD blog.
Top comments (0)