Yash Pritwani

Posted on May 7 • Originally published at techsaas.cloud

LLM Inference Optimization: Batching, Quantization, and Speculative Decoding

#webdev #tutorial #programming #devops

Originally published on TechSaaS Cloud

LLM Inference Optimization: Cut Costs 80% Without Cutting Quality

If you're serving LLM inference in production, you're probably paying 5-10x more than you need to. The default configurations of most serving frameworks optimize for simplicity, not efficiency.

Three techniques — continuous batching, quantization, and speculative decoding — can cut your inference costs by 80% and latency by 60%. Here's how each works and when to use them.

Technique 1: Continuous Batching

The Problem with Naive Batching

Traditional batching waits for N requests to arrive, then processes them together. This creates a latency-throughput tradeoff: small batches waste GPU cycles, large batches add waiting time.

Continuous Batching (Iteration-Level Scheduling)

Instead of batching at the request level, continuous batching schedules at the token level. New requests can join a running batch between token generations, and completed requests leave immediately.

# vLLM handles this automatically
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70B",
    tensor_parallel_size=4,
    max_num_batched_tokens=32768,  # Total tokens across all requests in batch
    max_num_seqs=256,              # Max concurrent sequences
)

Impact: 3-5x throughput improvement over naive batching. Latency for individual requests stays low because they don't wait for a full batch to form.

Benchmarks

Serving Framework	Requests/sec (Llama-3-70B)	P50 Latency	P99 Latency
Naive batching	12	2.1s	8.4s
vLLM (continuous)	47	0.8s	2.1s
TGI (continuous)	41	0.9s	2.4s

Technique 2: Quantization

Quantization reduces the precision of model weights from FP16 (16-bit) to INT8 or INT4, dramatically reducing memory usage and increasing inference speed.

The Tradeoff

Precision	Memory (70B model)	Speed vs FP16	Quality Loss
FP16	140GB	1x (baseline)	0%
INT8 (GPTQ)	70GB	1.5-2x	<1%
INT4 (AWQ)	35GB	2-3x	1-3%
INT4 (GGUF)	35GB	2-3x	1-5%

AWQ (Activation-aware Weight Quantization) is our recommendation for production. It preserves quality better than naive INT4 by identifying and protecting salient weight channels.

from vllm import LLM

# Serve a 70B model on a single A100 80GB (impossible with FP16)
llm = LLM(
    model="TheBloke/Llama-3-70B-AWQ",
    quantization="awq",
    tensor_parallel_size=1,  # Single GPU!
    gpu_memory_utilization=0.9,
)

When NOT to Quantize

Code generation models (precision matters for syntax)
Mathematical reasoning (quantization loses numerical precision)
Models smaller than 13B (the quality loss is proportionally larger)

Technique 3: Speculative Decoding

The insight: use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model. If the draft model is right (which it often is for common patterns), you get the speed of the small model with the quality of the large one.

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70B",
    speculative_model="meta-llama/Llama-3-8B",  # Draft model
    num_speculative_tokens=5,  # Generate 5 draft tokens per step
)

Impact: 1.5-2.5x speedup for generation-heavy workloads. The speedup is highest when the output is predictable (common language patterns, structured data) and lowest for creative/novel outputs.

Combining All Three

The techniques stack. Here's the configuration we use for a production chatbot serving 10K requests/hour:

llm = LLM(
    model="TheBloke/Llama-3-70B-AWQ",     # INT4 quantization
    quantization="awq",
    speculative_model="TheBloke/Llama-3-8B-AWQ",  # Quantized draft
    num_speculative_tokens=5,
    tensor_parallel_size=2,                 # 2x A100 40GB
    max_num_batched_tokens=32768,           # Continuous batching
    max_num_seqs=256,
)

Results vs naive FP16 serving:

Throughput: 12 req/s → 89 req/s (7.4x)
P50 latency: 2.1s → 0.4s (5.2x faster)
GPU cost: 4x A100 80GB → 2x A100 40GB (60% cost reduction)
Quality: <2% regression on MMLU benchmark

Common Mistakes

Before diving into infrastructure recommendations, avoid these pitfalls we've seen repeatedly:

Quantizing without benchmarking on YOUR data. Generic benchmarks (MMLU, HumanEval) don't reflect your use case. A model that scores well on academic benchmarks might hallucinate on your domain-specific queries after quantization. Always evaluate on a test set from your actual production traffic.
Using speculative decoding for creative tasks. Speculative decoding works best when the output is predictable — structured data, common language patterns, templated responses. For creative writing or novel reasoning, the draft model's predictions are wrong more often, reducing the speedup to near zero.
Ignoring cold start latency. vLLM's first request after loading a model takes 5-10x longer than subsequent requests due to CUDA kernel compilation. If your traffic is bursty, keep models warm with synthetic heartbeat requests.
Over-optimizing throughput at the expense of latency. Increasing batch size improves throughput but hurts tail latency. For interactive applications (chatbots, autocomplete), optimize for P95 latency first, then tune throughput.

Infrastructure Recommendations

For Startups (< $5K/month inference budget)

Use vLLM with AWQ quantization on a single A100 40GB
Start with Llama-3-8B-AWQ — surprisingly capable for most use cases
Add speculative decoding if latency matters more than throughput
Monitor with Prometheus — track tokens/second, queue depth, and P95 latency

For Mid-Market ($5K-$50K/month)

vLLM cluster with continuous batching and tensor parallelism
A/B test INT8 vs INT4 quantization for your specific use case
Implement request routing: simple queries to 8B model, complex to 70B
Add semantic caching (Redis + embeddings) for repeated queries — cuts 30-40% of inference calls

For Enterprise ($50K+/month)

Triton Inference Server for multi-model serving and advanced scheduling
Custom quantization calibrated on your domain data
Speculative decoding with fine-tuned draft models
Multi-region deployment with intelligent routing based on model availability and latency

Need help optimizing your LLM inference costs? We've deployed inference stacks that serve millions of requests at a fraction of the typical cost. Book a consultation or explore our AI infrastructure services.

DEV Community