Most people benchmark inference engines on throughput. Tokens per second, batch size limits, latency percentiles. But when you're training agents with reinforcement learning, there's a metric that matters more: correctness. A silent bug in your inference stack doesn't just slow you down—it poisons your training data, and you won't know for weeks.
The vLLM team just shipped V1, and buried in the release notes is a fix that should make anyone running RL training take notice. They found and corrected subtle correctness issues in how V0 handled certain token sequences under grouped query attention. The kind of bugs that don't crash your job but subtly shift your reward model's understanding of what "good" looks like.
Why RL is Unforgiving
Supervised fine-tuning is forgiving. If your inference engine produces slightly different logits for 0.1% of tokens, the gradient updates average out. RL is different. You're generating rollouts, computing advantages, updating policy and value networks in tight loops. A correctness bug doesn't average out—it compounds. Your policy learns from corrupted rollouts. Your value function trains on garbage advantages. By the time you notice the loss curve looks weird, you've burned thousands of GPU hours.
The vLLM V0 bugs were subtle enough to pass standard tests. They manifested under specific conditions: long contexts with particular attention patterns, batched generations with heterogeneous lengths, certain temperature settings. Exactly the conditions you hit when training agents that need to explore environments, maintain state, and generate variable-length reasoning traces.
What Changed in V1
The V1 rewrite isn't just a refactor. The team rebuilt the attention backends with correctness as the primary constraint, then optimized. They added comprehensive property-based testing that generates random sequences and verifies equivalence against a reference implementation. They caught edge cases in rotary position embeddings that only appeared at context lengths above 16k tokens.
More importantly, they changed how they think about the PagedAttention algorithm. V0 optimized for throughput first. V1 optimizes for correctness first, then recovers throughput. The result is an engine that generates identical outputs to reference implementations across the test matrix, while still maintaining competitive performance.
The Production Lesson
If you're running RL training at scale, you need to audit your inference stack for correctness, not just speed. Run equivalence tests against a reference implementation on your actual training distribution. Generate thousands of rollouts with both engines and compare reward distributions. Monitor for divergence in KL divergence estimates between your policy and reference policy.
vLLM V1 is a reminder that infrastructure for agent training has different requirements than infrastructure for chatbots. When your model is generating its own training data, correctness isn't a nice-to-have. It's the foundation everything else builds on.
The throughput numbers in V1 are good. But the correctness guarantees are what make it production-ready for RL.
Top comments (0)