"LLM Inference Optimization: The Line Item That Decides If Your AI Ships"

#ai #llm #machinelearning #performance

Training gets the headlines. Inference gets the bill. If you run LLMs in production, inference is almost certainly your biggest AI line item — a meter running 24/7 on every request. The gap between naive and optimized serving is routinely 5-10x in cost and 3-5x in latency.

The bottleneck is memory, not compute

During token generation, LLM inference is memory-bandwidth bound. An H100 has ~3.35 TB/s bandwidth but ~989 TFLOPS FP16 compute — during autoregressive decoding you're using only ~10-20% of that compute, waiting on weights and KV-cache to stream from memory. Every optimization attacks the same root cause: move less data, use it better.

The levers that matter

KV cache. It's often bigger than the weights. PagedAttention (vLLM) pages the cache like OS virtual memory, dropping waste from 60-80% to near-zero → 2-3x more concurrent requests. Prefix caching reuses the KV for shared system prompts / few-shot / RAG context. GQA shrinks the cache at the architecture level.
Continuous batching. Swap finished sequences out and new ones in every step — utilization goes from ~20-30% to 80-90%. The single biggest throughput win, and why vLLM / SGLang / TensorRT-LLM exist.
Quantization. FP8 / INT8 / INT4 (AWQ, GPTQ) move less data → lower cost and latency, smaller footprint. 8-bit is near-lossless for most tasks; 4-bit is often an acceptable trade. Validate on your own eval set.
Speculative decoding. A small draft model proposes tokens; the big model verifies in one pass → lower latency with no quality loss.
Right-size the model. The cheapest token is the one you never compute on an oversized model — route easy requests down, hard ones up.

In practice

Use a real serving framework (vLLM, SGLang, TensorRT-LLM) rather than hand-rolling. Measure your actual prompt/response shapes first — long shared prefixes favour prefix caching, high concurrency favours batching, long outputs favour KV-cache and quantization work. Track cost-per-1k-tokens, throughput, and tail latency — the numbers the business actually feels.

Inference optimization is where AI economics are won or lost. The techniques are well understood and together routinely cut serving cost 5-10x — often the deciding factor in whether an AI feature ships at all.

Full version on the VSBD blog.