Ingero Team

Posted on Jun 15 • Originally published at ingero.io

Hybrid Mamba-Transformer MoEs Hide Their Stalls in Places Dashboards Do Not Look

#ai #llm #machinelearning #performance

A trace of a hybrid Mamba-Transformer MoE inference run, broken down by layer type. The MoE all-to-all collective stalls dominate the tail. The dashboards saw 96% GPU utilization the entire window.

TL;DR

Hybrid Mamba-Transformer architectures (Nemotron 3 Nano Omni, Jamba and friends) shipped at speed in late April. These models break the assumptions vLLM and SGLang dashboards make about prefill/decode shape: Mamba state-space layers have one runtime profile, Transformer attention has another, MoE router blocks have a third (with all-to-all collective comm). The aggregate looks fine on a duty-cycle counter; the per-layer tail is full of hybrid MoE stalls nobody is decomposing. We trace one and decompose it.

What changed in late April

NVIDIA Nemotron 3 Nano Omni (Apr 28, open multimodal MoE) is the most prominent recent shipment, but it is one of several. The shape is consistent: a hybrid Mamba-Transformer backbone with mixture-of-experts routing, tuned to claim higher throughput than pure-Transformer baselines at comparable parameter counts.

On the inference engine side, vLLM and SGLang already track per-request metrics: TTFT, ITL, throughput. They do not yet decompose those metrics by layer type. For pure-Transformer models, the decomposition is mostly uninteresting (every layer has roughly the same runtime profile). For hybrid MoE, the decomposition is the entire story.

Three layer types, three runtime shapes

We captured a 60-second inference trace on a TensorDock H100 running a hybrid Mamba-Transformer MoE checkpoint and broke the kernel-launch events down by layer type:

layer type      n calls   p50 (us)   p99 (us)   tail ratio
----------------------------------------------------------
Mamba SSM         3,840         42         95         2.3x
Transformer attn  1,920         88        320         3.6x
MoE all-to-all      640        180     12,400        69x

The aggregate runtime distribution looks moderate: median 50us, p99 300us. The decomposition shows that the MoE all-to-all calls are 69x tail-heavy, dominating wall time despite being 1/9th the call count. The Mamba layers are tight and predictable. The Transformer attention is bursty because of variable-length prefill. The MoE all-to-all is where the model spends its tail.

The dashboard saw none of this

Throughout the same 60-second window, nvidia-smi reported 95-97% GPU utilization. DCGM SM_ACTIVE was at 92% mean. The vLLM-style metrics showed median TTFT 220ms – within target. None of those signals captured the per-layer-type variance, because they are all duty-cycle or end-to-end measurements that aggregate over the run.

The MoE all-to-all stall pattern is a classic case of throughput bottlenecked by the slowest variant: when one expert routing pattern produces an unbalanced communication step, the entire batch waits. The eBPF trace catches it because every cudaLaunchKernel and cudaStreamSync is recorded with timestamp + caller stack, so the per-layer decomposition is just a SQL query over the captured events.

What the per-layer decomposition tells the engine

Once the decomposition is in front of you, the engine choices change:

Batch routing: penalize batches where the expert distribution is highly unbalanced. The cost of the all-to-all is currently invisible to the routing logic.
Layer-pairing scheduler: avoid co-scheduling two MoE all-to-all calls on the same NCCL stream. Mamba and Transformer attn calls overlap cleanly; MoE all-to-all does not.
Per-expert capacity: when one expert consistently produces tail-heavy all-to-all timings, raise the capacity factor on that expert until the imbalance flattens.

All three are reasonable engine-side fixes. None of them is reachable without per-layer-type runtime data.

Try it on your own checkpoint

Capture a trace under load:

sudo ingero check
sudo ingero trace --duration 60s --db /tmp/hybrid.db

# Aggregate per-kernel-name runtime distribution
ingero query --db /tmp/hybrid.db \
  "SELECT name, count(*), percentile_cont(0.5) WITHIN GROUP (ORDER BY duration_us) AS p50, percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_us) AS p99 FROM events WHERE source='cuda' GROUP BY name ORDER BY p99 DESC LIMIT 20"

The kernel names will give away which layer type each row belongs to (Mamba layers reach conv1d and selective_scan_fwd, Transformer attn reach fused_attention, MoE all-to-all reach nccl_all_to_all or framework-specific dispatch wrappers).

Reading on hybrid architectures

Three public references for the hybrid-architecture regime: NVIDIA Nemotron 3 Nano Omni (April 28, 2026) is the most prominent recent open hybrid Mamba-Transformer MoE checkpoint and the source of the kernel-name patterns shown above; the Mamba paper (arXiv 2312.00752) describes the state-space layer’s structural difference from Transformer attention; and the vLLM documentation explains the prefill/decode batching model the per-layer decomposition above breaks against.

Hybrid models, hybrid stalls

When the architecture stops being uniform, the metrics that aggregate across the architecture stop being useful. Hybrid MoEs need per-layer-type decomposition to surface the regimes where one layer type dominates tail latency. eBPF gives the decomposition for free; the only thing missing is the SQL query that asks the right question. As more hybrid architectures ship, the dashboard layer will need to catch up – or the engineers running them will keep going under the dashboard with kernel-level traces instead.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running hybrid Mamba-Transformer MoE inference and seeing tail latency the dashboards do not explain.*

DEV Community