DEV Community

AI Tech Connect
AI Tech Connect

Posted on • Originally published at aitechconnect.in

Cut self-hosted LLM serving costs: quantise, batch, speculate

Originally published on AI Tech Connect.

What you need to know Cost is a serving problem, not a model problem. The same open-weight model can cost 5–8x more to serve naively than optimised on the same GPU. You change the runtime, not the weights. Inference has two phases. Prefill is parallel and compute-bound; decode is sequential and memory-bandwidth-bound. The KV cache is the memory that grows with sequence length and batch, and it is the dominant cost at scale. Four levers stack. Quantisation (weights and KV cache), continuous batching, speculative decoding and prefix caching attack different bottlenecks, so their gains largely multiply rather than overlap. Speculative decoding and prefix caching are free quality-wise. Quantisation is the one lever that can degrade output — so it is the one you must gate behind an eval.…


Read the full article on AI Tech Connect →

Top comments (0)