EVAL #008: NVIDIA Just Open-Sourced an Inference Engine. Now What?
By Ultra Dune | EVAL — The AI Tooling Intelligence Report | March 25, 2026
GTC happened. The model wave hit. And the inference stack will never look the same.
This was the densest week in AI tooling since the original ChatGPT launch sent everyone scrambling to ship embeddings. PyTorch 2.7 landed with native FP4. vLLM and SGLang both dropped major releases within 48 hours of each other. Transformers shipped support for four new model families simultaneously. And then NVIDIA walked into the room and open-sourced Dynamo — a full inference orchestration framework that competes directly with every serving engine in the ecosystem.
If you deploy models in production, this week changed your decision matrix. Let's break it down.
The Eval: NVIDIA Dynamo and the Inference Stack Shakeup
The Announcement Nobody Expected
At GTC 2026, Jensen Huang did what Jensen does best — he made an announcement that sounds like a partnership but is actually a declaration of war. NVIDIA open-sourced Dynamo, a Rust-and-Python inference orchestration framework designed for multi-node, multi-GPU LLM serving at data center scale.
On the surface, Dynamo is "just an orchestration layer" that sits on top of existing inference engines. It can use vLLM or TensorRT-LLM as its execution backend. NVIDIA positioned it as complementary — a rising tide that lifts all boats.
I don't buy it. Here's why.
What Dynamo Actually Does
Dynamo's core insight is that prefill and decode have fundamentally different hardware profiles. Prefill (processing your input tokens) is compute-bound — it wants raw FLOPS. Decode (generating output tokens one at a time) is memory-bandwidth-bound — it wants fast memory access. Every current inference engine runs both phases on the same GPU, which means every GPU is suboptimal for at least one phase at any given moment.
Dynamo disaggregates these phases. Separate GPU pools handle prefill and decode independently, connected by NIXL — a zero-copy, RDMA-enabled KV cache transfer library. A smart routing layer (the "Planner") makes real-time decisions about request placement based on KV cache locality, GPU load, and prefix sharing.
Think of it this way: Dynamo is to inference engines what Kubernetes is to containers. It doesn't replace the runtime. It orchestrates deployment topology.
The key components:
- Disaggregated prefill/decode: Independent GPU pools for each phase, independently scalable
- NIXL KV cache transfer: Zero-copy RDMA transfers between prefill and decode nodes — sub-millisecond for typical cache sizes
- Smart routing: KV-cache-aware request routing. If a system prompt is already cached on a node, your request goes there
- MoE-aware scheduling: Expert-parallel routing for Mixture-of-Experts models, routing tokens to the GPUs hosting relevant experts
- Elastic scaling: Dynamic GPU allocation based on workload demand
- Rust control plane: High-performance orchestration with Python APIs for ecosystem compatibility
- Backend agnostic: vLLM or TensorRT-LLM as execution backends
How It Compares
Here's the honest comparison matrix. I'm including vLLM 0.8.3, SGLang 0.5.2, and TGI 3.2.0 — all released the same week as Dynamo.
Scope and Architecture:
vLLM, SGLang, and TGI are inference engines. They manage batching, attention, KV caches, and token generation on a single node (with basic tensor/pipeline parallelism for multi-GPU). Dynamo operates above this layer — it manages fleets of inference workers across a cluster.
Disaggregated Serving:
Only Dynamo has first-class support. vLLM 0.8.3 has it as "experimental." SGLang and TGI don't. This is a genuine architectural advantage. NVIDIA's own benchmarks claim 3x throughput for long-context workloads. Third-party tests confirm meaningful gains at scale, with the caveat that KV transfer overhead eats into the benefit for short requests.
Multi-Node Intelligence:
vLLM and SGLang handle multi-GPU within a node well. Cross-node, they rely on basic tensor parallelism. Dynamo provides cluster-wide routing with KV cache locality awareness. If you're running inference across 8+ nodes, nothing else does this.
Structured Output:
Dynamo doesn't touch this — it relies on the backend engine. Here's where vLLM and SGLang shine independently. Both now integrate XGrammar, which achieves <3% overhead for JSON schema-constrained generation. SGLang's jump-forward decoding goes further — it skips 40-60% of LLM forward passes for known JSON schemas by identifying deterministic token sequences. More on this below.
The Catch:
Dynamo's value is almost entirely at scale. If you're running 1-4 GPUs, raw vLLM or SGLang will outperform Dynamo because the orchestration overhead isn't justified. The break-even point appears to be around 8+ GPUs across multiple nodes with high-concurrency workloads.
The Strategic Read
Here's what the community is missing. NVIDIA doesn't release open-source tools out of generosity. They release them to drive GPU adoption. Dynamo is optimized for NVIDIA hardware. It uses NIXL, which relies on NVIDIA's networking stack. It integrates tightest with TensorRT-LLM. The Planner is tuned for NVIDIA GPU topologies.
This is the NVIDIA playbook: open-source the software layer to lock in the hardware layer. If Dynamo becomes the standard orchestration framework for inference at scale, every new cluster deployment becomes a conversation about NVIDIA networking in addition to NVIDIA GPUs.
Is that bad? Not necessarily. The Apache 2.0 license is real. The code is on GitHub. But understand the incentive structure: Dynamo's job is to make you need more NVIDIA GPUs, not fewer.
When to Use What
Here's my honest recommendation:
Use vLLM 0.8.3 if: You're running single-node or basic multi-GPU inference. The radix tree prefix caching is excellent. The model support is broadest. The community is largest. Calibration-free FP8 on Blackwell is production-ready.
Use SGLang 0.5.2 if: You need the fastest structured output generation (jump-forward + XGrammar is unbeatable). You're deploying DeepSeek-R2 or any MoE model (their custom MoE kernel is 35% faster than vLLM on DeepSeek-R2-671B). You want cutting-edge throughput and don't mind a smaller community.
Use TGI 3.2.0 if: You're in the HuggingFace ecosystem, need multi-LoRA serving (dynamic adapter loading with per-request selection, up to 50 concurrent adapters), or want the simplest path from model hub to production.
Use Dynamo if: You're operating at data center scale (8+ nodes), need disaggregated prefill/decode, want KV-cache-aware routing across a cluster, or you're already deep in the NVIDIA enterprise stack. If you aren't sure whether you need Dynamo, you don't need Dynamo.
The Bigger Picture
The inference stack is being decomposed. A year ago, you picked one engine and it handled everything. Now we have specialized layers — execution engines (vLLM, SGLang, TGI), orchestration frameworks (Dynamo), structured generation engines (XGrammar), and quantization toolchains (TorchAO). Each layer is independently optimizable.
This is what mature infrastructure looks like. It's also more complex to operate. The paradox of choice is real: five options, each optimal for a different deployment profile, none of them wrong.
My bet: Dynamo will matter enormously for cloud providers and large enterprises. SGLang will eat vLLM's lead on performance benchmarks. TGI will stay the best "it just works" option for HuggingFace users. And in 12 months, someone will build a meta-framework that abstracts over all of them. Probably NVIDIA.
The Changelog
1. PyTorch 2.7.0 — The Foundation Shifts
The biggest release this week by impact radius. 1,200+ commits from 350+ contributors.
- torch.compile for LLMs: Auto KV cache management, dynamic sequence lengths without graph breaks. 45% faster compilation. This alone justifies the upgrade for any training pipeline.
- FlexAttention: Sliding window, MLA (Multi-head Latent Attention), Mixture-of-Attention patterns now hit 95% of FlashAttention-3 performance with full programmability.
-
Native FP4 support:
torch.float4_e2m1for Blackwell GPUs via TorchAO. MXFP4 (microscaling FP4 with shared 8-bit exponent per 32-element block) preserves 96-98% of FP8 accuracy with 2x memory savings. No calibration data needed — unlike INT4 which requires careful calibration. Still experimental, and only hardware-accelerated on Blackwell. - Async Tensor Parallelism: Overlaps communication + computation for distributed training.
- Context Parallel: Long-context training up to 2M tokens.
If you haven't upgraded from PyTorch 2.6, this is the release worth the migration pain.
2. vLLM 0.8.3 — Prefix Caching Gets Serious
- Radix tree prefix caching: 60% TTFT reduction for shared system prompts. If you're running agents that share a long system prompt, this changes your latency profile overnight.
- Multi-node TP redesign: 40% reduced latency on 8-node configs with new NCCL tuning for H100/H200.
- Calibration-free FP8 W8A8: <0.5% accuracy degradation with per-channel FP8 on Blackwell. No calibration dataset needed. Just quantize and serve.
- New models: Gemma 3, Command A, DeepSeek-R2, Llama 4 Scout/Maverick.
- Breaking: Dropped PagedAttention v1, requires CUDA 12.4+, dropped Python 3.9.
3. SGLang 0.5.2 — The Speed King
SGLang is making a serious play for the inference crown.
- Data parallelism overhaul: Zero-overhead DP scheduler, mixed DP+TP configs. 2x throughput on multi-GPU setups.
- DeepSeek-R2 optimizations: Custom MoE kernel + MLA decoding kernel delivers 35% over vLLM 0.8.2 on DeepSeek-R2-671B. That's not a benchmark trick — it's a genuine engineering lead on the most popular open MoE model.
- Jump-forward JSON decoding: Up to 5x speedup for structured output via XGrammar v0.2. Eliminates 40-60% of LLM forward passes by detecting deterministic token sequences in constrained generation. Combined with XGrammar's <1μs per-token mask computation, structured output overhead is now effectively zero.
- Eagle-2 speculative decoding: Up to 2.8x speedup with auto draft model selection.
4. HuggingFace Transformers 4.51.0 — The Great Model Intake
Four new model families in one release. That's unprecedented.
- New architectures: Gemma 3, Command A, Llama 4 (Scout + Maverick MoE), DeepSeek-R2-Distill.
-
QuantoConfig: Native FP4 and FP8 quantization config. Run
from_pretrained()with FP4 on Blackwell. - KV cache quantization: Auto FP8 KV caches cut memory 40-50%.
- MLA support: First-class Multi-head Latent Attention for DeepSeek-style architectures.
5. TGI 3.2.0 — Multi-LoRA Production Ready
TGI remains the quiet workhorse. This release makes it the best option for adapter serving.
- Speculative Decoding 2.0: Medusa-2 heads with auto-calibrated speculation length. 2.5x speedup.
- Multi-LoRA serving: Dynamic adapter loading/unloading, per-request adapter selection, up to 50 concurrent adapters. If you're serving personalized models, this is the easiest path.
- Flash Attention 3 on Blackwell. 30% faster cold start.
6. LangChain 0.3.22 — MCP Goes Mainstream
-
Multi-agent orchestration:
AgentSupervisorandAgentTeamclasses — no LangGraph required. - Full MCP support: Model Context Protocol client/server for cross-framework tool discovery. This matters. When LangChain and LlamaIndex both ship MCP in the same week, it's no longer experimental — it's the standard.
- 40% fewer core dependencies. Finally.
7. llama.cpp b5189 — The Local Hero Keeps Shipping
- Gemma 3 QAT (Quantization-Aware Training) support.
- Vulkan backend performance improvements.
- RPC multi-host inference — run a model across multiple machines over the network.
- ARM SVE2 optimizations for the growing ARM server market.
GTC 2026 Rapid Fire
Beyond Dynamo, the GTC announcements that matter for tooling engineers:
Vera Rubin GPU — Production H2 2026. 288GB HBM4. 3-4x over Blackwell Ultra for training. Start planning your 2027 cloud migrations now.
DGX Spark — $3,999 personal AI supercomputer. Grace Blackwell GB10, 128GB unified memory, 1 PFLOP FP4. Ships May 2026. The first sub-$5K device with native CUDA. Can run 200B-parameter models in FP4. The catch: 273 GB/s memory bandwidth means token generation is slower than a Mac Studio M4 Ultra (819 GB/s) for pure inference. But CUDA compatibility with 95%+ of the ML ecosystem makes it the dev box for anyone who needs to fine-tune, not just serve.
CUDA 13 — Vera Rubin support, improved multi-GPU scheduling. The foundation for next-gen compute.
Spectrum-X800 — 800 Gb/s AI networking. NVIDIA now competes with traditional network vendors. The full-stack play deepens.
The Signal
1. NVIDIA Is Becoming an Infrastructure Company
Count the layers NVIDIA now owns: chips (GPUs), networking (Spectrum-X, NVLink), serving software (Dynamo, TensorRT-LLM, Triton), desktop hardware (DGX Spark/Station), development frameworks (NeMo, RAPIDS), and a model catalog (100+ NIM microservices). The only layer they're missing is the cloud itself — and they're the largest customer of every major cloud provider.
The "NVIDIA tax" isn't just about GPU margins anymore. It's about every layer of the stack gently steering you toward more NVIDIA hardware. Dynamo is free and open-source. It also works best on NVIDIA networking with NVIDIA GPUs running TensorRT-LLM. This isn't a conspiracy — it's a business model. And it's working.
2. Structured Output Is Solved
XGrammar + jump-forward decoding has made grammar-constrained generation essentially free. Sub-3% overhead. Sub-microsecond mask computation. No more choosing between "fast inference" and "guaranteed JSON output."
For agent builders, this is the most underrated development of the year. Reliable structured output was the bottleneck for production agent systems. That bottleneck is gone. Both vLLM and SGLang ship it out of the box. If you're still parsing raw text output with regex, stop. The tooling has caught up.
3. CoreWeave Goes Public, AI GPU Cloud Grows Up
CoreWeave (CRWV) is now trading publicly at ~$35B valuation and expanding its Blackwell Ultra clusters. They've committed as a launch partner for Vera Rubin. As the largest pure-play GPU cloud on public markets, CoreWeave is the bellwether for GPU demand. Meanwhile, Cisco closed its $1.3B acquisition of AI networking startup Infrastellar, signaling that purpose-built AI networking is now a major enterprise category.
The infrastructure layer of AI is maturing fast. The question is no longer "can we get GPUs?" but "which orchestration layer do we standardize on?" — and that's exactly the question Dynamo was designed to answer.
Skill of the Week: XGrammar Structured Output
This week's topic yields a clear skill: deploying structured JSON output with near-zero overhead using XGrammar on vLLM and SGLang.
The technique: Use XGrammar as the structured output backend (default in vLLM 0.7+ and SGLang 0.4+). Define your output schema once. The engine handles grammar compilation (<10ms), precomputes bitmasks for 95%+ of vocabulary tokens, and applies <1μs per-token constraint checking. On SGLang, enable jump-forward decoding to skip 40-60% of LLM calls for deterministic JSON tokens.
If you're building agents that need to output tool calls, API payloads, or structured data — this is how you do it without paying a throughput tax.
Full skill pack available at: github.com/softwealth/eval-report-skills
EVAL is the weekly AI tooling intelligence report for ML engineers and AI agents. We eval the tools so you can ship the models.
Subscribe: buttondown.com/ultradune
Skill Packs: github.com/softwealth/eval-report-skills
Twitter: @eval_report
Next week: The FP4 Practical Guide — when to quantize, which format to use, and the benchmarks nobody else is running.
EVAL is the weekly AI tooling intelligence report. We eval the tools so you can ship the models.
Subscribe free: buttondown.com/ultradune
Skill Packs for agents: github.com/softwealth/eval-report-skills
Top comments (0)