A trace of a hybrid Mamba-Transformer MoE inference run, broken down by layer type. The MoE all-to-all collective stalls dominate the tail. The dashboards saw 96% GPU utilization the entire window.
TL;DR
Hybrid Mamba-Transformer architectures (Nemotron 3 Nano Omni, Jamba and friends) shipped at speed in late April. These models break the assumptions vLLM and SGLang dashboards make about prefill/decode shape: Mamba state-space layers have one runtime profile, Transformer attention has another, MoE router blocks have a third (with all-to-all collective comm). The aggregate looks fine on a duty-cycle counter; the per-layer tail is full of hybrid MoE stalls nobody is decomposing. We trace one and decompose it.
What changed in late April
NVIDIA Nemotron 3 Nano Omni (Apr 28, open multimodal MoE) is the most prominent recent shipment, but it is one of several. The shape is consistent: a hybrid Mamba-Transformer backbone with mixture-of-experts routing, tuned to claim higher throughput than pure-Transformer baselines at comparable parameter counts.
On the inference engine side, vLLM and SGLang already track per-request metrics: TTFT, ITL, throughput. They do not yet decompose those metrics by layer type. For pure-Transformer models, the decomposition is mostly uninteresting (every layer has roughly the same runtime profile). For hybrid MoE, the decomposition is the entire story.
Three layer types, three runtime shapes
We captured a 60-second inference trace on a TensorDock H100 running a hybrid Mamba-Transformer MoE checkpoint and broke the kernel-launch events down by layer type:
layer type n calls p50 (us) p99 (us) tail ratio
----------------------------------------------------------
Mamba SSM 3,840 42 95 2.3x
Transformer attn 1,920 88 320 3.6x
MoE all-to-all 640 180 12,400 69x
The aggregate runtime distribution looks moderate: median 50us, p99 300us. The decomposition shows that the MoE all-to-all calls are 69x tail-heavy, dominating wall time despite being 1/9th the call count. The Mamba layers are tight and predictable. The Transformer attention is bursty because of variable-length prefill. The MoE all-to-all is where the model spends its tail.
The dashboard saw none of this
Throughout the same 60-second window, nvidia-smi reported 95-97% GPU utilization. DCGM SM_ACTIVE was at 92% mean. The vLLM-style metrics showed median TTFT 220ms – within target. None of those signals captured the per-layer-type variance, because they are all duty-cycle or end-to-end measurements that aggregate over the run.
The MoE all-to-all stall pattern is a classic case of throughput bottlenecked by the slowest variant: when one expert routing pattern produces an unbalanced communication step, the entire batch waits. The eBPF trace catches it because every cudaLaunchKernel and cudaStreamSync is recorded with timestamp + caller stack, so the per-layer decomposition is just a SQL query over the captured events.
What the per-layer decomposition tells the engine
Once the decomposition is in front of you, the engine choices change:
- Batch routing: penalize batches where the expert distribution is highly unbalanced. The cost of the all-to-all is currently invisible to the routing logic.
- Layer-pairing scheduler: avoid co-scheduling two MoE all-to-all calls on the same NCCL stream. Mamba and Transformer attn calls overlap cleanly; MoE all-to-all does not.
- Per-expert capacity: when one expert consistently produces tail-heavy all-to-all timings, raise the capacity factor on that expert until the imbalance flattens.
All three are reasonable engine-side fixes. None of them is reachable without per-layer-type runtime data.
Try it on your own checkpoint
Capture a trace under load:
sudo ingero check
sudo ingero trace --duration 60s --db /tmp/hybrid.db
# Aggregate per-kernel-name runtime distribution
ingero query --db /tmp/hybrid.db \
"SELECT name, count(*), percentile_cont(0.5) WITHIN GROUP (ORDER BY duration_us) AS p50, percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_us) AS p99 FROM events WHERE source='cuda' GROUP BY name ORDER BY p99 DESC LIMIT 20"
The kernel names will give away which layer type each row belongs to (Mamba layers reach conv1d and selective_scan_fwd, Transformer attn reach fused_attention, MoE all-to-all reach nccl_all_to_all or framework-specific dispatch wrappers).
Reading on hybrid architectures
Three public references for the hybrid-architecture regime: NVIDIA Nemotron 3 Nano Omni (April 28, 2026) is the most prominent recent open hybrid Mamba-Transformer MoE checkpoint and the source of the kernel-name patterns shown above; the Mamba paper (arXiv 2312.00752) describes the state-space layer’s structural difference from Transformer attention; and the vLLM documentation explains the prefill/decode batching model the per-layer decomposition above breaks against.
Hybrid models, hybrid stalls
When the architecture stops being uniform, the metrics that aggregate across the architecture stop being useful. Hybrid MoEs need per-layer-type decomposition to surface the regimes where one layer type dominates tail latency. eBPF gives the decomposition for free; the only thing missing is the SQL query that asks the right question. As more hybrid architectures ship, the dashboard layer will need to catch up – or the engineers running them will keep going under the dashboard with kernel-level traces instead.
Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running hybrid Mamba-Transformer MoE inference and seeing tail latency the dashboards do not explain.*
Related reading
- GPU utilization is a counter, not a cause – the parent argument: utilization aggregates over the architecture
- tracing a distributed training stall across nodes – cross-rank collective stalls, fleet-mode
- 11-second time to first token on a healthy vLLM server – tail-latency decomposition on the vLLM side

Top comments (0)