DEV Community

Yash Pritwani
Yash Pritwani

Posted on • Originally published at techsaas.cloud

LLM Inference Optimization: Batching, Quantization, and Speculative Decoding

Originally published on TechSaaS Cloud


Originally published on TechSaaS Cloud


LLM Inference Optimization: Cut Costs 80% Without Cutting Quality

If you're serving LLM inference in production, you're probably paying 5-10x more than you need to. The default configurations of most serving frameworks optimize for simplicity, not efficiency.

Three techniques — continuous batching, quantization, and speculative decoding — can cut your inference costs by 80% and latency by 60%. Here's how each works and when to use them.

Technique 1: Continuous Batching

The Problem with Naive Batching

Traditional batching waits for N requests to arrive, then processes them together. This creates a latency-throughput tradeoff: small batches waste GPU cycles, large batches add waiting time.

Continuous Batching (Iteration-Level Scheduling)

Instead of batching at the request level, continuous batching schedules at the token level. New requests can join a running batch between token generations, and completed requests leave immediately.

# vLLM handles this automatically
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70B",
    tensor_parallel_size=4,
    max_num_batched_tokens=32768,  # Total tokens across all requests in batch
    max_num_seqs=256,              # Max concurrent sequences
)
Enter fullscreen mode Exit fullscreen mode

Impact: 3-5x throughput improvement over naive batching. Latency for individual requests stays low because they don't wait for a full batch to form.

Benchmarks

Serving Framework Requests/sec (Llama-3-70B) P50 Latency P99 Latency
Naive batching 12 2.1s 8.4s
vLLM (continuous) 47 0.8s 2.1s
TGI (continuous) 41 0.9s 2.4s

Technique 2: Quantization

Quantization reduces the precision of model weights from FP16 (16-bit) to INT8 or INT4, dramatically reducing memory usage and increasing inference speed.

The Tradeoff

Precision Memory (70B model) Speed vs FP16 Quality Loss
FP16 140GB 1x (baseline) 0%
INT8 (GPTQ) 70GB 1.5-2x <1%
INT4 (AWQ) 35GB 2-3x 1-3%
INT4 (GGUF) 35GB 2-3x 1-5%

AWQ (Activation-aware Weight Quantization) is our recommendation for production. It preserves quality better than naive INT4 by identifying and protecting salient weight channels.

from vllm import LLM

# Serve a 70B model on a single A100 80GB (impossible with FP16)
llm = LLM(
    model="TheBloke/Llama-3-70B-AWQ",
    quantization="awq",
    tensor_parallel_size=1,  # Single GPU!
    gpu_memory_utilization=0.9,
)
Enter fullscreen mode Exit fullscreen mode

When NOT to Quantize

  • Code generation models (precision matters for syntax)
  • Mathematical reasoning (quantization loses numerical precision)
  • Models smaller than 13B (the quality loss is proportionally larger)

Technique 3: Speculative Decoding

The insight: use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model. If the draft model is right (which it often is for common patterns), you get the speed of the small model with the quality of the large one.

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70B",
    speculative_model="meta-llama/Llama-3-8B",  # Draft model
    num_speculative_tokens=5,  # Generate 5 draft tokens per step
)
Enter fullscreen mode Exit fullscreen mode

Impact: 1.5-2.5x speedup for generation-heavy workloads. The speedup is highest when the output is predictable (common language patterns, structured data) and lowest for creative/novel outputs.

Combining All Three

The techniques stack. Here's the configuration we use for a production chatbot serving 10K requests/hour:

llm = LLM(
    model="TheBloke/Llama-3-70B-AWQ",     # INT4 quantization
    quantization="awq",
    speculative_model="TheBloke/Llama-3-8B-AWQ",  # Quantized draft
    num_speculative_tokens=5,
    tensor_parallel_size=2,                 # 2x A100 40GB
    max_num_batched_tokens=32768,           # Continuous batching
    max_num_seqs=256,
)
Enter fullscreen mode Exit fullscreen mode

Results vs naive FP16 serving:

  • Throughput: 12 req/s → 89 req/s (7.4x)
  • P50 latency: 2.1s → 0.4s (5.2x faster)
  • GPU cost: 4x A100 80GB → 2x A100 40GB (60% cost reduction)
  • Quality: <2% regression on MMLU benchmark

Common Mistakes

Before diving into infrastructure recommendations, avoid these pitfalls we've seen repeatedly:

  1. Quantizing without benchmarking on YOUR data. Generic benchmarks (MMLU, HumanEval) don't reflect your use case. A model that scores well on academic benchmarks might hallucinate on your domain-specific queries after quantization. Always evaluate on a test set from your actual production traffic.

  2. Using speculative decoding for creative tasks. Speculative decoding works best when the output is predictable — structured data, common language patterns, templated responses. For creative writing or novel reasoning, the draft model's predictions are wrong more often, reducing the speedup to near zero.

  3. Ignoring cold start latency. vLLM's first request after loading a model takes 5-10x longer than subsequent requests due to CUDA kernel compilation. If your traffic is bursty, keep models warm with synthetic heartbeat requests.

  4. Over-optimizing throughput at the expense of latency. Increasing batch size improves throughput but hurts tail latency. For interactive applications (chatbots, autocomplete), optimize for P95 latency first, then tune throughput.

Infrastructure Recommendations

For Startups (< $5K/month inference budget)

  • Use vLLM with AWQ quantization on a single A100 40GB
  • Start with Llama-3-8B-AWQ — surprisingly capable for most use cases
  • Add speculative decoding if latency matters more than throughput
  • Monitor with Prometheus — track tokens/second, queue depth, and P95 latency

For Mid-Market ($5K-$50K/month)

  • vLLM cluster with continuous batching and tensor parallelism
  • A/B test INT8 vs INT4 quantization for your specific use case
  • Implement request routing: simple queries to 8B model, complex to 70B
  • Add semantic caching (Redis + embeddings) for repeated queries — cuts 30-40% of inference calls

For Enterprise ($50K+/month)

  • Triton Inference Server for multi-model serving and advanced scheduling
  • Custom quantization calibrated on your domain data
  • Speculative decoding with fine-tuned draft models
  • Multi-region deployment with intelligent routing based on model availability and latency

Need help optimizing your LLM inference costs? We've deployed inference stacks that serve millions of requests at a fraction of the typical cost. Book a consultation or explore our AI infrastructure services.

Top comments (0)