Manoranjan Rajguru

Posted on Jun 10

How Xiaomi Cracked 1,000 Tokens/Second on a 1-Trillion Parameter Model: A Deep Dive into LLM Inference Optimization

#ai #llm #machinelearning #gpu

Meta Description: Xiaomi's MiMo-V2.5-Pro-UltraSpeed just shattered the 1,000 tokens/second barrier on a 1T-parameter model using commodity GPUs. This deep dive unpacks the FP4 quantization, DFlash speculative decoding, and TileRT persistent engine techniques that made it possible — with runnable SGLang code.

How Xiaomi Cracked 1,000 Tokens/Second on a 1-Trillion Parameter Model: A Deep Dive into LLM Inference Optimization

The Problem: Why Trillion-Parameter Inference Is a Different Beast
Technique 1 — Expert-Only FP4 (MXFP4) Quantization
Technique 2 — DFlash: Block-Diffusion Speculative Decoding
Technique 3 — TileRT: Persistent Engine Kernels & Warp Specialization
The Full Stack: Why Hardware-Software Co-Design Was Non-Negotiable
Deploying It Yourself: SGLang Setup & Code Examples
Benchmarks & Performance Numbers
What 1,000 TPS Means for Agentic AI System Design
Conclusion & What's Next

Introduction

On June 8, 2026, Xiaomi's MiMo team published a result that stopped the Hacker News front page cold: 1,000+ tokens per second decode throughput on a 1-trillion-parameter model, running on a single 8-GPU commodity node.

To appreciate why this is extraordinary, consider the baseline. State-of-the-art 1T-parameter models — the kind powering frontier reasoning, coding agents, and multi-step tool use — have historically decoded at 30–80 tokens/second at best on standard GPU clusters, and far less when serving long-context requests. Pushing past the 1,000 TPS threshold isn't a 10× improvement. It is, as TileRT's engineering team put it, an operation under entirely different dimensions of hardware reality.

This post is a full technical autopsy of how they did it. We'll cover three interlocking techniques — Expert-Only FP4 Quantization, DFlash Block-Diffusion Speculative Decoding, and TileRT's Persistent Engine Kernel — and why none of them would have worked in isolation. We'll end with runnable SGLang deployment code and a discussion of what this means for the engineers building agentic systems today.

LLM inference optimization just reached a new frontier. Let's tear it apart.

1. The Problem: Why Trillion-Parameter Inference Is a Different Beast {#the-problem}

Before we look at the solutions, it's worth being precise about what makes a 1T-parameter model fundamentally harder to serve than a 70B model — not just quantitatively, but structurally.

The memory bandwidth wall. LLM decoding is memory-bandwidth-bound, not compute-bound. Each forward pass during autoregressive generation reads every active model weight from HBM once per token. At FP8 (1 byte per value), a 1T-parameter model requires loading roughly 1 TB of weights per token generated. An H100 GPU has ~3.35 TB/s of HBM3 bandwidth. On a single GPU, that caps you at ~3 tokens/second before you even account for KV cache, attention, and routing overhead. Scaling to 8 GPUs with tensor parallelism helps, but inter-GPU communication (AllReduce/AllGather) adds latency that compounds at scale.

The Mixture-of-Experts (MoE) routing problem. Frontier 1T-parameter models — including MiMo-V2.5-Pro — are sparse MoE architectures, meaning each token activates only a subset of expert FFN layers (MiMo activates ~42B parameters per token out of 1.02T total). This reduces FLOPs per token dramatically, but it creates expert routing imbalance: some experts get hot while others sit idle, and the routing decisions fragment data locality on the GPU's memory hierarchy.

The execution gap problem at scale. Traditional inference frameworks decompose the model into individual operators (GEMM, RMSNorm, RoPE, Softmax, KV cache write, etc.) and launch them sequentially. Each operator boundary incurs host-side launch overhead, hardware sync barriers, and global memory round-trips. At 30–80 TPS, these overheads are amortized. At 1,000 TPS — where each token's lifecycle is measured in microseconds — those same overheads become the dominant bottleneck. RMSNorm alone, a trivially cheap operation, can impose tens of microseconds of fragmentation at this clock rate.

Breaking through 1,000 TPS required attacking all three dimensions simultaneously. The MiMo + TileRT team's answer was a three-pronged co-design:

Shrink what you load → FP4 quantization
Verify more per forward pass → DFlash speculative decoding
Eliminate execution gaps → TileRT persistent kernels

Let's go deep on each.

2. Technique 1 — Expert-Only FP4 (MXFP4) Quantization {#fp4-quantization}

The Bandwidth Bottleneck in Numbers

The standard weapon against memory-bandwidth limits is quantization: reduce the number of bits per weight to reduce the memory you need to load per token. The progression has gone FP16 → INT8/FP8 → now FP4. At FP4 (4 bits per value), you halve the memory footprint vs. FP8 and quarter it vs. FP16 — meaning you can load weights twice as fast from HBM.

The problem with naïve FP4 quantization is representational collapse. FP4 can only express 16 distinct values. Apply it uniformly across the attention projections, layer norms, and embedding layers — which require fine numerical precision for stable attention score computation and output distribution — and you'll see immediate degradation in reasoning, math, and code generation quality.

The MoE Insight: Experts Are More Tolerant

The key insight the MiMo team leveraged is that in an MoE model, not all layers are equal in their sensitivity to quantization noise. The MoE Expert FFN layers hold the vast majority of parameters (~95% at the 1T scale) and are empirically more tolerant to bit-width reduction. This makes intuitive sense: each expert is a relatively narrow FFN that specializes in a domain, and small perturbations in its weights are softened by the router's selection logic and the residual stream.

By contrast, attention projections (Q, K, V, and output projections) are globally sensitive — every token passes through them, and the dot-product attention mechanism amplifies any numerical noise in the key and query projections into attention score errors.

MXFP4 with QAT

The specific format used is MXFP4 (Microscaling FP4) with a block size of 32 — a shared-exponent format from the MX (Microscaling) specification that groups 32 values under a common scale factor, giving each group a shared exponent and per-element 3-bit mantissas. This is meaningfully better than naïve INT4 at preserving the dynamic range of expert weight distributions.

Critically, the team used Quantization-Aware Training (QAT) — fine-tuning the model with FP4 quantization simulated in the forward pass (via straight-through estimators in the backward pass). QAT allows the model to learn weight distributions that are intrinsically friendly to FP4 representation, closing the accuracy gap vs. post-training quantization.

The o_proj (attention output projection) in every layer is explicitly excluded from FP4 — a small but significant detail that preserves the critical output pathway at full precision.

Benchmark Impact

The results speak for themselves:

Benchmark	FP8 Baseline	MXFP4 (Expert-Only)	Δ
SWE-Bench Pro	57.2%	58.8%	+2.80%
SWE-Bench Verified	78.9%	77.4%	-1.90%
Claw-Eval (General Agent)	63.8%	67.8%	+6.27%
Humanity's Last Exam	48.0%	47.0%	-2.08%

The headline here is that MXFP4 expert-only quantization achieves near-lossless quality — and on the two most practically important benchmarks for agentic coding use cases (SWE-Bench Pro and Claw-Eval), it actually outperforms the FP8 baseline. This is plausibly a regularization effect: the quantization noise during QAT acts as a mild stochastic perturbation that improves generalization on distribution-shifted evaluation sets.

3. Technique 2 — DFlash: Block-Diffusion Speculative Decoding {#dflash}

The Speculative Decoding Premise

Speculative decoding is a well-established technique for reducing the effective number of backbone forward passes needed to generate N tokens. The core idea:

A small, cheap draft model autoregressively generates K candidate tokens.
The large backbone model verifies all K candidates in a single parallel forward pass.
A rejection sampling procedure accepts a prefix of the verified tokens losslessly (no change to output distribution).
On average, each backbone forward pass produces α × K tokens, where α is the acceptance rate (0 < α ≤ 1).

The throughput gain is α × K vs. 1 token per backbone pass. If your draft model generates K=4 tokens and achieves α=0.8 acceptance, you get ~3.2 effective tokens per backbone pass — a 3.2× improvement.

The Traditional Bottleneck

Conventional speculative decoding has a structural problem: the draft model generates tokens autoregressively — one at a time, serially, attending to all prior context. This means:

Draft compute scales linearly with context length (each step is an O(n) attention over the full sequence).
The draft model must be strong enough to achieve high α, but stronger draft models are more expensive, eating into the savings.
For very long contexts (1M token window, as in MiMo-V2.5-Pro), draft model cost becomes prohibitive.

There's a fundamental tension: you need a cheap draft model, but cheap draft models have low acceptance rates.

DFlash: Breaking the Serial Bottleneck

DFlash adopts a radically different drafting paradigm from the research community: block-level masked parallel prediction. Instead of drafting tokens one at a time, the DFlash draft model:

Takes a context with K positions masked (a block of unknown tokens).
Fills all K masked positions simultaneously in a single forward pass, treating the problem as conditional masked language modeling.
The backbone verifies the entire filled block at once via rejection sampling.

This eliminates the serial autoregressive constraint from the draft stage entirely. The draft model's cost is now constant with respect to block size (a single parallel forward pass regardless of K), rather than linear.

The SWA Trick for Long-Context Efficiency

For the DFlash drafter to run efficiently on MiMo's 1M-context window, the team made a crucial architectural choice: the drafter uses Sliding Window Attention (SWA) exclusively — attending only to a local window rather than the full context. Since MiMo-V2.5's backbone already uses SWA in its lower layers, the drafter shares this architectural motif.

The consequence is significant: the drafter's per-prediction compute becomes constant rather than linear in context length. You can serve 1M-token contexts without the draft cost exploding.

Training with Muon and Self-Distillation

The 5-layer BF16 DFlash drafter is trained using:

Muon optimizer (a second-order optimizer) for better sample efficiency on the small drafter network.
Model self-distillation — the backbone itself provides soft training targets, allowing the drafter to learn the backbone's token distribution rather than hard ground-truth labels.
GPU-local mask sampling — during training, mask positions are sampled independently on each GPU shard, so a single long sequence generates tens of thousands of independent training signals in one step, maximizing data efficiency without cross-device communication.

Acceptance Length Results

With block size capped at 8 (a deliberate choice to balance acceptance rate vs. verification overhead), DFlash achieves:

Scenario	Mean Accepted Tokens/Step	Max Observed
WebDev / Coding	6.30	7.14
Math / Reasoning	5.56	—
SWE-Bench (Agent)	4.29	—
MT-Bench (General)	3.18	—

In coding scenarios — the primary use case — 6.3 out of 8 draft tokens are accepted on average. The backbone effectively generates ~6.3 tokens per forward pass instead of 1. Combined with the halved memory bandwidth from FP4, you're compounding two multiplicative speedup factors.

4. Technique 3 — TileRT: Persistent Engine Kernels & Warp Specialization {#tilert}

The Microsecond War

This is where the engineering gets truly deep. Even with FP4 halving memory load times and DFlash multiplying effective tokens per pass, achieving >1,000 TPS requires the underlying GPU execution engine to sustain operations at microsecond cadence without interruption.

At 1,000 TPS, the lifespan of an individual token's computation is measured in microseconds. At this resolution, operations that would be invisible in a 30-TPS system become catastrophic bottlenecks:

Host-side kernel launch overhead — each CUDA kernel launch from the CPU costs ~5–15 µs of scheduling. At 1,000 TPS with dozens of operators per layer, this is hundreds of microseconds of dead time per second.
Hardware synchronization barriers — cudaDeviceSynchronize or event-based barriers between operators stall the execution stream.
Global memory round-trips — writing intermediate results to HBM between operators and reading them back for the next operator doubles effective memory traffic.
Auxiliary operator overhead — RMSNorm, RoPE, KV cache writes, and LM Head computation for Multi-Token Prediction (MTP) are individually cheap (tens of µs each), but at 1,000 TPS their aggregate contribution is substantial.

Traditional inference frameworks — even highly optimized ones — launch operators one by one. At 30–80 TPS, the compute work dominates and these gaps are invisible. At 1,000 TPS, the gaps are the bottleneck.

The Persistent Engine Paradigm

TileRT's central innovation is replacing the per-operator-launch model with a Persistent Engine Kernel: a single, monolithic CUDA kernel that encapsulates the entire inference pipeline and runs continuously resident on the GPU without ever returning control to the host between tokens.

Key properties of the persistent engine:

End-to-end continuous prefetching. Because the compute pipeline is persistent, TileRT can pipeline data movement against computation at the tile level. While the current tile computes on Tensor Cores, the next tile's data is already staged from HBM → L2 cache → shared memory → registers. There is no gap between "kernel ends, data loads, kernel starts".

Tile-level pipeline decomposition. The computation is broken into fine-grained Tiles. Within each Tile, data movement, tensor computation, and cross-device communication are physically overlapped at the warp level.

Warp Specialization & Heterogeneous Workers

Within the persistent engine, TileRT implements Warp Specialization — assigning dedicated warp groups to distinct pipeline roles:

Compute warps execute the GEMM and attention Tensor Core operations.
Data movement warps ("DMA warps") handle the memory hierarchy prefetching and staging.
Communication warps manage AllReduce/AllGather operations across GPUs.

Rather than all warps executing the same instructions in lock-step (the traditional SIMT model), different warp groups execute different code paths simultaneously, coordinated via lightweight shared-memory semaphores. This heterogeneous execution model transforms the GPU from a single-purpose SIMD device into a continuously flowing pipeline with multiple specialized execution lanes.

This pattern is then extended beyond a single SM via Heterogeneous Workers — the same specialization strategy is applied across the full GPU execution domain, allowing the entire device to behave as one integrated heterogeneous pipeline.

The result: execution gaps from operator boundaries are eliminated at the hardware level, not papered over with runtime tricks.

5. The Full Stack: Why Hardware-Software Co-Design Was Non-Negotiable {#codesign}

Reading the above, you might think these three techniques are independently applicable optimizations. They are not. Each one required the others to exist in their current form.

FP4 required TileRT's quantization-aware kernels. The MXFP4 format with block-size-32 scaling is not natively supported in standard CUDA libraries. TileRT had to write custom GEMM kernels that handle the mixed-precision computation (FP4 experts, higher-precision attention) within the persistent engine pipeline. The quantization scheme was chosen specifically because TileRT's team could build efficient kernels for it.

DFlash required SWA alignment, which required the backbone's architecture. The drafter's SWA design only works efficiently because MiMo-V2.5-Pro's backbone itself uses SWA in its lower layers. The backbone architecture was designed with this future co-design in mind. You cannot drop DFlash onto an arbitrary transformer and expect the same gains.

TileRT's microsecond optimizations required knowing the model's operator graph at design time. The persistent engine isn't a general JIT compiler — it's a purpose-built execution plan for MiMo-V2.5-Pro's specific layer structure (70 layers, 128 attention heads, 8 GQA KV heads, SWA window size 128, etc.). Custom operator fusion decisions were made jointly by the TileRT and MiMo model teams.

At 1,000 TPS, the system is operating so close to the physical limits of H100 hardware that any mismatch between model architecture and runtime assumptions manifests as measurable throughput degradation. The co-design wasn't a nice-to-have — it was the only path to breaking the barrier.

6. Deploying It Yourself: SGLang Setup & Code Examples {#deployment}

The MiMo-V2.5-Pro-FP4-DFlash checkpoint is fully open-sourced on HuggingFace and runs via SGLang, which has native support for the DFLASH speculative decoding algorithm. Here's how to deploy it.

Prerequisites

2× nodes, each with 8× H100 80GB GPUs (or equivalent HBM3 node)
SGLang ≥ 0.4 with FlashAttention 3 support
~640 GB total GPU HBM (FP4 model fits in ~512 GB; allocate headroom for KV cache)

Server Launch (Multi-Node)

# Launch the SGLang server with DFlash speculative decoding
# Run on both nodes with appropriate MASTER_ADDR, WORLD_SIZE, and RANK env vars

python3 -m sglang.launch_server \
    --model XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash/dflash \
    --speculative-num-draft-tokens 8 \
    --ep-size 16 \
    --tensor-parallel-size 16 \
    --data-parallel-size 2 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --quantization fp8 \
    --attention-backend fa3 \
    --moe-dense-tp-size 1 \
    --dtype bfloat16 \
    --mem-fraction-static 0.65 \
    --context-length 65536 \
    --page-size 1 \
    --trust-remote-code \
    --disable-overlap-schedule \
    --skip-server-warmup \
    --dist-init-addr ${MASTER_ADDR}:20000 \
    --nnodes ${WORLD_SIZE} \
    --node-rank ${RANK} \
    --host 0.0.0.0 \
    --port 29999

Key flags explained:

Flag	Purpose
`--speculative-algorithm DFLASH`	Enables the DFlash block-diffusion speculative decoder
`--speculative-num-draft-tokens 8`	Block size for masked prediction (matches training config)
`--ep-size 16`	Expert parallelism across 16 GPUs (distributes MoE routing)
`--tensor-parallel-size 16`	Tensor parallelism for attention layers across 16 GPUs
`--enable-dp-attention`	Data-parallel attention for throughput under batch serving
`--attention-backend fa3`	FlashAttention 3 kernel for maximum attention throughput
`--page-size 1`	Fine-grained KV cache paging for long-context efficiency
`--disable-overlap-schedule`	Required for DFlash — disables standard prefill/decode overlap

Python Client (OpenAI-Compatible API)

Once the server is running, you can call it via the standard OpenAI SDK — SGLang exposes an OpenAI-compatible endpoint:

from openai import OpenAI
import time

# SGLang exposes an OpenAI-compatible endpoint
client = OpenAI(
    base_url="http://localhost:29999/v1",
    api_key="none",  # SGLang doesn't require auth by default
)

# --- Streaming inference with TPS measurement ---
def generate_with_tps(prompt: str, max_tokens: int = 512) -> None:
    """
    Generate a response and report real-time tokens per second.
    Demonstrates the throughput achievable with DFlash + FP4.
    """
    start = time.perf_counter()
    token_count = 0

    stream = client.chat.completions.create(
        model="MiMo-V2.5-Pro-FP4-DFlash",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=0.0,  # Deterministic for benchmarking
        stream=True,
    )

    print("Response: ", end="", flush=True)
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)
            token_count += 1  # Approximate: 1 chunk ≈ 1 token

    elapsed = time.perf_counter() - start
    tps = token_count / elapsed
    print(f"\n\n--- {token_count} tokens in {elapsed:.2f}s = {tps:.1f} TPS ---")


# Example: coding task that benefits most from DFlash
generate_with_tps(
    prompt="""Implement a high-performance LRU cache in Python using OrderedDict.
    Include O(1) get and put operations, thread safety with RWLock,
    and a decorator API for memoization. Add comprehensive docstrings.""",
    max_tokens=1024,
)

Best-of-N Parallel Sampling (Exploiting 1000 TPS)

At 1,000 TPS, you can afford to run Best-of-N sampling — generating multiple candidate responses and selecting the best — within wall-clock latency budgets that previously only allowed a single response:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="http://localhost:29999/v1", api_key="none")

async def best_of_n_sample(
    prompt: str,
    n: int = 8,
    verifier=None,
    max_tokens: int = 512,
) -> str:
    """
    Generate N independent responses in parallel and return the best one.
    At 1000 TPS, 8 parallel candidates complete in the same wall-clock time
    as 1 candidate at ~125 TPS — enabling search-based quality improvement
    that was previously latency-prohibitive.
    """
    tasks = [
        client.chat.completions.create(
            model="MiMo-V2.5-Pro-FP4-DFlash",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=0.7,  # Diversity for sampling
        )
        for _ in range(n)
    ]

    # Fire all N requests concurrently
    responses = await asyncio.gather(*tasks)
    candidates = [r.choices[0].message.content for r in responses]

    if verifier:
        # Score each candidate with your verifier (unit tests, reward model, etc.)
        scores = [verifier(c) for c in candidates]
        return candidates[scores.index(max(scores))]

    # Fallback: return longest response (proxy for completeness)
    return max(candidates, key=len)


# Usage
async def main():
    result = await best_of_n_sample(
        prompt="Write a Python function that finds the kth largest element "
               "in an unsorted array in O(n) average time.",
        n=8,
    )
    print(result)

asyncio.run(main())

7. Benchmarks & Performance Numbers {#benchmarks}

Here is the full performance picture from the official MiMo and TileRT blog posts:

Throughput:

Peak decode speed: ~1,200 tokens/second (measured on single 8-GPU node)
Sustained speed: >1,000 tokens/second for standard coding and reasoning tasks
Hardware: Single 8-GPU node with commodity H100s (no custom ASICs, no Groq, no Cerebras)

Speculative Decoding Acceptance Lengths (DFlash, block size=8):

Task	Mean Accepted Tokens	Backbone Passes Saved (vs. no spec dec)
WebDev / Coding	6.30	~84% reduction
Math / Reasoning	5.56	~76% reduction
SWE-Bench (Agent)	4.29	~63% reduction
MT-Bench (General)	3.18	~47% reduction

Quality (FP4 vs FP8):

Benchmark	FP8	MXFP4	Δ
SWE-Bench Pro	57.2%	58.8%	+2.80% ✅
SWE-Bench Verified	78.9%	77.4%	-1.90%
Claw-Eval Agent	63.8%	67.8%	+6.27% ✅
HLE (w/ tool)	48.0%	47.0%	-2.08%
HLE (no tool)	34.0%	33.0%	-2.94%

The quality-throughput tradeoff is remarkably favorable. The worst degradation is ~3% on a single benchmark, while the throughput gain is ~12–20×.

8. What 1,000 TPS Means for Agentic AI System Design {#agentic}

This section is for engineers building systems, not just running benchmarks. The 1,000 TPS barrier has concrete architectural implications.

1. Best-of-N Becomes Standard Practice

Previously, generating 8 independent reasoning paths and selecting the best required accepting 8× the latency or 8× the cost. At 1,000 TPS serving 8 parallel streams, you can run Best-of-8 sampling within the latency budget that used to afford only Best-of-1. For coding agents, this means automatically running multiple solution attempts and selecting the one that passes the most tests — a pattern that directly improves end-to-end task success rates.

2. Test-Time Compute Scales Differently

Test-Time Scaling (think o1/o3-style extended reasoning) involves generating many tokens of "thinking" before producing an answer. At 100 TPS, a 10,000-token thinking budget costs 100 seconds — user-hostile for interactive applications. At 1,000 TPS, the same 10,000-token budget costs 10 seconds. Extended reasoning becomes viable in near-interactive loops, changing the design space for agents that need to plan before acting.

3. Multi-Agent Coordination Loops Close Faster

In multi-agent architectures (orchestrator → sub-agents → tool calls → synthesis), the bottleneck is often the synthesis step — a large model aggregating results from many sub-agent calls. At 1,000 TPS, synthesis latency compresses enough that you can run more iterations of the orchestration loop within a fixed time budget, enabling deeper planning horizons.

4. Real-Time Decision Loops for 1T Models

Previously, deploying a 1T-parameter model in latency-sensitive scenarios (fraud detection, real-time bidding, surgical assist) was architecturally impossible — the per-token latency was too high for millisecond-level decision loops. At 1,000 TPS, a 100-token decision output is generated in ~100ms. This brings frontier-scale reasoning into time-critical applications for the first time.

5. The Developer Experience Shift

The Hacker News discussion captured this best: when an AI coding agent completes a task before you finish reading the prompt back, the human-computer interaction model changes. The bottleneck shifts from model generation to human review. Systems need to be redesigned around a human-in-the-loop pattern where the human is now the slower component — code review UIs, diff approval flows, and asynchronous batch task queues become more important than low-latency streaming displays.

9. Conclusion & What's Next {#conclusion}

1,000 tokens per second on a 1-trillion-parameter model. Three months ago, you would have needed a Groq or Cerebras system — purpose-built ASICs that cost orders of magnitude more than commodity GPUs — to approach this number. Today, it runs on a standard 8-GPU node, the kind that sits in virtually every major cloud provider's catalog.

The three techniques that made this possible — Expert-Only MXFP4 Quantization, DFlash Block-Diffusion Speculative Decoding, and TileRT's Persistent Engine with Warp Specialization — each address a distinct layer of the inference stack. But their real power comes from deep hardware-software co-design: model architecture, quantization scheme, draft model design, and GPU execution engine were all engineered together, not composed from independently developed components.

This is the new template for pushing LLM inference optimization past theoretical hardware limits. The lesson isn't "use FP4" or "use speculative decoding" — those are tactics. The lesson is: break down the wall between model design and systems engineering.

For engineers building production systems today:

The open-source MiMo-V2.5-Pro-FP4-DFlash checkpoint is available now on HuggingFace at XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash.
SGLang deployment with DFLASH support is production-ready today.
The UltraSpeed API (via platform.xiaomimimo.com/ultraspeed) is open for application through June 23, 2026 — if you have a high-throughput agentic workload, apply now.

What's on the horizon? The MiMo team has signaled that UltraSpeed support for MiMo-V2.5 (not just V2.5-Pro) is coming, and the DFlash acceptance rates in general-conversation scenarios are still suboptimal — expect further drafter improvements. More broadly, as the industry digests this result, expect FP4 QAT + block-diffusion speculative decoding to become standard components of the LLM inference optimization stack, just as FP8 and vanilla speculative decoding are today.

The sound barrier of LLM inference just got broken. The race to the next threshold starts now.

References

MiMo-V2.5-Pro-UltraSpeed Official Blog — Xiaomi MiMo Team, June 8, 2026
TileRT: Breaking 1000 TPS — TileRT Engineering Blog, June 8, 2026
MiMo-V2.5-Pro-FP4-DFlash HuggingFace Model Card
Hacker News Discussion: 515 points, 374 comments
SGLang Documentation
MX Microscaling Specification (MXFP4)

Found this useful? Star the MiMo HuggingFace repo, drop a comment with your TPS numbers once you deploy, and share this with any engineer building agentic systems who still thinks 1T models are too slow to serve.

DEV Community

How Xiaomi Cracked 1,000 Tokens/Second on a 1-Trillion Parameter Model: A Deep Dive into LLM Inference Optimization

How Xiaomi Cracked 1,000 Tokens/Second on a 1-Trillion Parameter Model: A Deep Dive into LLM Inference Optimization

Table of Contents

Introduction

1. The Problem: Why Trillion-Parameter Inference Is a Different Beast {#the-problem}

2. Technique 1 — Expert-Only FP4 (MXFP4) Quantization {#fp4-quantization}

The Bandwidth Bottleneck in Numbers

The MoE Insight: Experts Are More Tolerant

MXFP4 with QAT

Benchmark Impact

3. Technique 2 — DFlash: Block-Diffusion Speculative Decoding {#dflash}

The Speculative Decoding Premise

The Traditional Bottleneck

DFlash: Breaking the Serial Bottleneck

The SWA Trick for Long-Context Efficiency

Training with Muon and Self-Distillation

Acceptance Length Results

4. Technique 3 — TileRT: Persistent Engine Kernels & Warp Specialization {#tilert}

The Microsecond War

The Persistent Engine Paradigm

Warp Specialization & Heterogeneous Workers

5. The Full Stack: Why Hardware-Software Co-Design Was Non-Negotiable {#codesign}

6. Deploying It Yourself: SGLang Setup & Code Examples {#deployment}

Prerequisites

Server Launch (Multi-Node)

Python Client (OpenAI-Compatible API)

Best-of-N Parallel Sampling (Exploiting 1000 TPS)

7. Benchmarks & Performance Numbers {#benchmarks}

8. What 1,000 TPS Means for Agentic AI System Design {#agentic}

1. Best-of-N Becomes Standard Practice

2. Test-Time Compute Scales Differently

3. Multi-Agent Coordination Loops Close Faster

4. Real-Time Decision Loops for 1T Models

5. The Developer Experience Shift

9. Conclusion & What's Next {#conclusion}

References

Top comments (0)