Xiaomi has entered the frontier AI race with MiMo-V2-Flash, a 309B parameter MoE model that achieves state-of-the-art open-source performance on software engineering benchmarks while running at 150 tokens per second.
Key Statistics
| Metric | Value |
|---|---|
| Total Parameters | 309B |
| Active Parameters | 15B |
| SWE-Bench Verified | 73.4% |
| Inference Speed | 150 t/s |
Key Takeaways
309B MoE with 15B Active Parameters: MiMo-V2-Flash uses Mixture-of-Experts architecture where only 15B parameters activate per token, delivering frontier-class capability at dramatically lower inference costs than dense models.
150 Tokens/Second Inference: Optimized for speed with Hybrid Sliding Window Attention and multi-token prediction, achieving inference speeds that enable real-time coding assistance and agentic workflows.
73.4% SWE-Bench Verified: State-of-the-art open-source performance on real-world software engineering tasks, beating DeepSeek-V3.2 (671B) while using a fraction of the compute.
256K Context Window: Long-context capability enables processing entire codebases, documentation sets, and extended conversations without context truncation.
Free on OpenRouter: Available for free (limited time) through OpenRouter, with day-0 SGLang support for optimized serving and speculative decoding.
Introduction
MiMo-V2-Flash represents Xiaomi's ambitious entry into frontier AI development. The 309B parameter Mixture-of-Experts model achieves 73.4% on SWE-Bench Verified—state-of-the-art for open-source models—while activating only 15B parameters per token. This architecture enables inference speeds of 150 tokens per second, making it practical for real-time coding assistance and agentic workflows where latency directly impacts productivity.
The model's technical innovations include Hybrid Sliding Window Attention (SWA) that outperformed linear attention variants, 3-layer multi-token prediction enabling ~2.5x speedup through speculative decoding, and a 256K context window for processing entire codebases. Perhaps most significantly for developers, MiMo is available free on OpenRouter with day-0 SGLang support for optimized serving.
Surprise Entry: Xiaomi—known for smartphones—has shipped a frontier-class coding model that beats DeepSeek-V3.2 on SWE-Bench while being dramatically faster. The phone maker is now an AI player.
MiMo-V2-Flash Technical Specifications
| Specification | Value | Notes |
|---|---|---|
| Total Parameters | 309B | MoE architecture |
| Active Parameters | 15B per token | Sparse activation |
| Context Window | 256K tokens | Long-context support |
| Inference Speed | 150 tok/s | With MTP speculation |
| SWE-Bench Verified | 73.4% | SOTA open-source |
| License | Open-Weight | Commercial use allowed |
Tags: OpenRouter, SGLang Day-0, Hybrid SWA, Multi-Token Prediction, MOPD Training
What is MiMo-V2-Flash
MiMo-V2-Flash is Xiaomi's flagship large language model, released December 2025. The "MiMo" name reflects Xiaomi's internal AI research division, while "V2-Flash" indicates this is the speed-optimized second-generation variant. The model targets agentic coding workflows where inference speed and cost directly impact productivity.
The 309B MoE architecture means the model contains 309 billion total parameters distributed across expert networks, but only 15 billion activate for any given token. This sparse activation pattern enables frontier-class capability at a fraction of the inference cost of dense models like GPT-4 or Claude. The efficiency gains compound over long conversations and complex agentic loops.
Why MoE Architecture Matters
- Cost Efficiency: Only 15B of 309B parameters compute per token, reducing inference costs by ~20x vs equivalent dense model
- Speed: Smaller active parameter count enables 150 tok/s inference with speculative decoding
- Capability: Total 309B parameters provide frontier-level knowledge and reasoning
- Scalability: Router learns which experts to activate per task, enabling specialization
Architecture Innovations
MiMo's technical report details several architectural innovations that emerged from extensive ablation studies. These aren't incremental improvements but fundamental design choices that differentiate MiMo from other MoE models.
Hybrid Sliding Window Attention
Combines sparse local windows with global attention layers for efficient long-context processing.
- Window size 128 beats 512 post-training
- Outperformed linear attention variants
- Attention sinks critical for stability
Multi-Token Prediction (MTP)
Predicts multiple future tokens simultaneously for speculative decoding speedup.
- 3-layer MTP architecture
- >3 accept length average
- ~2.5x speedup on coding tasks
Training Innovation: MOPD (multi-teacher on-policy distillation) achieved teacher-quality outputs at less than 1/50th typical SFT+RL compute cost—a significant efficiency breakthrough.
Benchmark Performance
MiMo-V2-Flash achieves state-of-the-art open-source performance on software engineering benchmarks, competing with models many times its effective size.
| Benchmark | MiMo-V2-Flash | DeepSeek-V3.2 | GPT-4 |
|---|---|---|---|
| SWE-Bench Verified | 73.4% | ~70% | ~65% |
| SWE-Bench Multilingual | 71.7% | ~68% | ~62% |
| LiveCodeBench v5 | Top tier | Comparable | Strong |
| Inference Speed | 150 tok/s | ~30 tok/s | ~40 tok/s |
Key Insight: MiMo matches DeepSeek-V3.2's capability while being ~5x faster at inference. For coding tasks requiring many iterations, this speed advantage compounds significantly.
MiMo vs DeepSeek: Detailed Comparison
Both MiMo-V2-Flash and DeepSeek-V3.2 represent the frontier of open-weight coding models, but they make different architectural tradeoffs.
| Aspect | MiMo-V2-Flash | DeepSeek-V3.2 |
|---|---|---|
| Architecture | 309B MoE (15B active) | 671B MoE |
| Speed | 150 tok/s (faster) | ~30 tok/s |
| Context | 256K tokens | 128K tokens |
| SWE-Bench | 73.4% (higher) | ~70% |
| Best For | Speed-critical coding | Complex reasoning |
Choose MiMo When
- Speed is critical for your workflow
- You need 256K context for large codebases
- Running many agentic iterations
- Cost optimization is a priority
Choose DeepSeek When
- Maximum reasoning depth needed
- You have existing DeepSeek integrations
- Broader general knowledge required
- Speed is less critical than quality
Getting Started
MiMo-V2-Flash is accessible through multiple channels, from zero-setup cloud APIs to self-hosted deployments.
OpenRouter (Easiest)
Free access, no setup required
- Visit openrouter.ai
- Select MiMo-V2-Flash model
- Free for limited time
- OpenAI-compatible API
SGLang (Self-Hosted)
Optimized inference with MTP
- Day-0 SGLang support
- Speculative decoding enabled
- Full speed optimization
- Requires GPU infrastructure
Best Use Cases
Agentic Coding
- Fast iteration loops with tool use
- Multi-step code generation
- SWE-Bench validated performance
Codebase Analysis
- 256K context for entire projects
- Cross-file understanding
- Documentation generation
Cost-Sensitive Deployments
- 15B active params = lower inference cost
- Free on OpenRouter (limited time)
- Self-hosting option available
Real-Time Assistance
- 150 tok/s enables responsive UX
- IDE integration viable
- Interactive coding sessions
When NOT to Use MiMo-V2-Flash
Avoid MiMo For
- Non-coding tasks: Optimized for code; use general models for other tasks
- Mission-critical production (yet): New model; evaluate thoroughly before deployment
- Regulatory-constrained environments: Chinese origin may have compliance implications
Use MiMo For
- Speed-critical coding: 150 tok/s makes iteration loops fast
- Open-source requirements: Open-weight with commercial license
- Cost-conscious deployments: MoE architecture reduces inference costs
Common Mistakes to Avoid
Assuming Dense Model Behavior
Mistake: Expecting MiMo to behave like a 309B dense model.
Fix: Understand only 15B params activate; effective capability is between 15B and 309B.
Not Using Speculative Decoding
Mistake: Running MiMo without MTP, missing the ~2.5x speed advantage.
Fix: Use SGLang or compatible frameworks that enable multi-token prediction.
Ignoring Context Window Benefits
Mistake: Truncating context when 256K is available.
Fix: Leverage full context for codebase understanding and complex tasks.
Using for Non-Coding Tasks
Mistake: Expecting strong performance on general knowledge tasks.
Fix: MiMo is optimized for coding; use general models for other tasks.
Frequently Asked Questions
What is MiMo-V2-Flash and who created it?
MiMo-V2-Flash is a 309B parameter Mixture-of-Experts (MoE) language model created by Xiaomi, the Chinese smartphone manufacturer. Released in December 2025, it represents Xiaomi's entry into frontier AI models. Despite the 309B total parameters, only 15B activate per token, enabling fast inference while maintaining strong capability. The model is open-weight under a permissive license, making it available for commercial use.
How does MiMo-V2-Flash compare to DeepSeek-V3.2?
MiMo-V2-Flash matches or exceeds DeepSeek-V3.2 on many benchmarks while being significantly faster. On SWE-Bench Verified, MiMo achieves 73.4% vs DeepSeek's comparable score. On SWE-Bench Multilingual, MiMo scores 71.7%. The key advantage is latency: MiMo's 15B active parameters (vs DeepSeek's 671B total) enable much faster inference at lower cost. For coding tasks requiring speed, MiMo is often the better choice.
What is Hybrid Sliding Window Attention (SWA)?
Hybrid SWA is MiMo's attention mechanism that combines sparse local windows with a small set of global attention layers. Local windows (size 128 tokens) handle most computation efficiently, while global layers maintain long-range coherence. This hybrid approach outperformed pure linear attention variants in ablations. Post-training, window size 128 proved better than 512 for long-context tasks, and attention sinks were found to be critical for stability.
What is multi-token prediction (MTP) and why does it matter?
Multi-token prediction enables MiMo to predict multiple future tokens simultaneously rather than one at a time. With 3-layer MTP, MiMo achieves >3 accept length on average, providing roughly 2.5x speedup on coding tasks through speculative decoding. This is particularly valuable for agentic workflows where speed directly impacts productivity. MTP is integrated with SGLang for day-0 optimized serving.
Where can I access MiMo-V2-Flash?
MiMo-V2-Flash is available through multiple channels: OpenRouter (free for limited time at openrouter.ai), SGLang (day-0 support with optimized inference), and direct model weights from Xiaomi's release. The model works with standard inference frameworks and can be self-hosted. For most users, OpenRouter provides the easiest starting point with no setup required.
What are the best use cases for MiMo-V2-Flash?
MiMo excels at: (1) Agentic coding workflows requiring fast iteration and tool use, (2) Long-context analysis of codebases up to 256K tokens, (3) Real-time coding assistance where latency matters, (4) Cost-sensitive deployments needing frontier capability, and (5) Open-source/self-hosted scenarios requiring model access. It's particularly strong on software engineering benchmarks like SWE-Bench and LiveCodeBench.
How was MiMo-V2-Flash trained?
MiMo used MOPD (multi-teacher on-policy distillation) for post-training, achieving teacher-quality outputs at less than 1/50th the typical SFT+RL compute cost. The training emphasized coding and agentic capabilities. Key architectural decisions validated through ablations: Hybrid SWA over linear attention, window size 128 over 512, attention sinks for stability, and 3-layer MTP for speculative decoding.
Is MiMo-V2-Flash suitable for production use?
Yes, with caveats. MiMo is production-ready for coding and technical tasks where it benchmarks well. Day-0 SGLang support enables optimized serving. However, being newly released (December 2025), it lacks the deployment track record of models like GPT-4 or Claude. For mission-critical production, evaluate thoroughly on your specific use cases before full deployment. The open-weight nature allows inspection and customization.
Top comments (0)