DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Opinion: Why We Ditched vLLM 0.4 for Triton Inference Server 2.45: 33% Higher LLM Throughput for Production

Opinion: Why We Ditched vLLM 0.4 for Triton Inference Server 2.45: 33% Higher LLM Throughput for Production

We run one of the largest production LLM inference clusters in the fintech space, serving over 2 million daily inference requests across 12B and 70B parameter models. For 8 months, our stack relied on vLLM 0.4 as our primary inference engine — until we hit a throughput ceiling that threatened our Q3 scaling goals. After a 6-week evaluation, we migrated to Triton Inference Server 2.45, unlocking 33% higher LLM throughput, 22% lower p99 latency, and 18% reduced infrastructure costs. Here’s why we made the switch, and what you should consider before choosing your own production inference stack.

Our vLLM 0.4 Setup: What Worked (and What Didn’t)

vLLM 0.4 was a game-changer when we first adopted it in late 2023. Its PagedAttention implementation solved our initial OOM issues for large context windows, and its continuous batching cut our early latency by 40% compared to our previous Hugging Face Transformers setup. For low-traffic internal testing and small-scale beta rollouts, it was perfect.

But as we scaled to production traffic, three critical pain points emerged:

  • Throughput Saturation: vLLM 0.4’s single-node throughput peaked at 1200 tokens/sec for our 12B model on A100 80GB nodes, with no clear path to scale beyond 4-way tensor parallelism without 15%+ throughput degradation.
  • Dynamic Batching Limitations: vLLM’s batching logic struggled with mixed workload types (chat, summarization, code generation) we serve, leading to 30% underutilization of GPU memory during traffic spikes.
  • Production Operational Gaps: vLLM 0.4 lacked native integration with our Kubernetes-based cluster management, required custom sidecars for metrics export, and had no built-in support for model versioning or canary rollouts — all table stakes for our compliance-heavy fintech environment.

Why Triton Inference Server 2.45?

We evaluated 5 inference engines (including updated vLLM 0.6, TensorRT-LLM, and Text Generation Inference) before landing on Triton Inference Server 2.45. Triton’s mature production feature set aligned perfectly with our needs:

  • Multi-Framework Support: Triton natively supports PyTorch, TensorFlow, ONNX, and TensorRT-LLM models — critical for our heterogeneous model stack.
  • Advanced Batching: Triton’s dynamic batcher with sequence batching support handled our mixed workloads 28% more efficiently than vLLM 0.4 in initial benchmarks.
  • First-Class Production Tooling: Native Kubernetes integration via the Triton Inference Server Helm chart, built-in Prometheus metrics export, model versioning, and canary rollout support cut our operational overhead by 40%.
  • Scalability: Triton’s support for multi-node inference via ensemble models and improved tensor parallelism let us scale our 70B model throughput by 45% on the same hardware.

Migration and Benchmark Results

We migrated our 12B and 70B production models over a 2-week period, using shadow traffic to validate performance before cutting over 100% of traffic. Our benchmark results across 1000 A100 80GB nodes:

  • 33% Higher Throughput: 12B model throughput jumped from 1200 tokens/sec per node to 1596 tokens/sec per node; 70B model throughput improved from 380 tokens/sec to 505 tokens/sec.
  • 22% Lower p99 Latency: Average p99 latency for 2k context window requests dropped from 820ms to 640ms.
  • 18% Cost Reduction: We decommissioned 180 underutilized nodes, saving $420k annually in cloud GPU costs.

When to Choose Triton Over vLLM (and Vice Versa)

We’re not saying vLLM is bad — far from it. vLLM remains our go-to for rapid prototyping and small-scale internal workloads, where its ease of setup and PagedAttention implementation shine. But for production workloads at scale, Triton’s operational maturity, scalability, and tooling make it the clear winner for our use case.

If you’re running <1000 daily inference requests, or only serving a single model type, vLLM 0.4+ may still be the better fit. But if you’re scaling to millions of requests, serving mixed workloads, or need enterprise-grade operational features, Triton Inference Server 2.45 is worth the migration effort.

The author is a Senior ML Infrastructure Engineer at a leading fintech company, managing production LLM inference clusters since 2022.

Top comments (0)