Xiaomi's MiMo Code gets better as tasks get harder. Here's how.

#ai #llm #machinelearning #devops

Xiaomi's MiMo team just open-sourced MiMo Code — a terminal coding agent built on top of OpenCode, MIT licensed. The pitch isn't raw benchmark numbers. It's something more specific: what happens when a task takes 50, 100, or 200+ steps?

Most coding agents fall apart at scale. Context fills up, the model loses the thread, and it either halts or confidently completes the wrong thing. MiMo Code is explicitly engineered around those failure modes.

"A single step error rate gets amplified across long-horizon runs, with no human correction signal in the loop."

That framing shapes every architectural decision they made.

What actually changed (vs. standard coding agents)

MiMo Code organises its design around three themes: Compute, Memory, and Evolution.

Compute — spending more tokens on reliability, not just output:

Max Mode runs 5 parallel candidates each step (temperature=1), then uses a low-temperature judge to pick the best — before executing anything. 10–20% improvement on SWE-Bench Pro, at 4–5x token cost. Experimental, opt-in.
Goal is an independent verifier that fires when the agent tries to call "done." A separate model call reviews the full conversation history and checks whether the completion condition is actually met. If not, it tells the agent what's missing. Loop rate under 0.5%.
Dynamic Workflow converts orchestration from prompt text to executable JavaScript in an isolated sandbox. parallel(), pipeline(), barrier() — no more natural language step sequences that the model can skip or misinterpret.

Memory — keeping the logical session alive past the context window:

The key concept is a Cycle: the runtime inserts checkpoints at 20%, 45%, and 70% of the context budget (not when the window is almost full — deliberately early, when the model can still reason clearly). A background writer subagent reads the conversation and writes a structured 11-field checkpoint file. When the window eventually fills, the runtime opens a fresh window and rebuilds state from those files — under 65K tokens, covering tasks, session state, project memory, recent messages verbatim, and a tail reminder.

The main agent never writes its own memory. Only the writer touches those files. Single-writer, no ambiguity.

Evolution — not starting fresh every session:

Project-level memory (a MEMORY.md per repo) persists architecture decisions, user rules, verified facts. Every 7 days, a Dream agent consolidates and deduplicates it. Every 30 days, a Distill agent scans historical sessions for repeated patterns and promotes them into reusable skills and SOPs.

The benchmark that matters

On standard offline benchmarks (single-repo, one-shot problems), MiMo Code + MiMo-V2.5-Pro beats Claude Code + Claude Sonnet 4.6 across three evals. But Xiaomi flags this themselves: those benchmarks don't reflect what MiMo Code is actually optimised for.

Their human A/B test is more telling: 576 developers, 474 real private repos, 1,213 matched pairs, same task run against both agents blind. Under 200 steps: roughly 50/50. Over 200 steps: MiMo Code wins 65%+ of the time.

The architecture is doing exactly what it was designed to do.

What to do

Evaluating coding agents for long tasks? This is worth a look. Install is one line: npm install -g @mimo-ai/cli or curl -fsSL https://mimo.xiaomi.com/install | bash. Free tier uses MiMo-V2.5 with 1M token context.
Running Claude Code on short tasks? Probably no urgency — the under-200-step performance is comparable.
Building your own agent harness? The checkpoint-early insight (extract state at 20/45/70%, not 95%) and the separate writer subagent pattern are worth stealing regardless of what runtime you're using.
Interested in the model? MiMo-V2.5-Pro is what powers the competitive benchmark results. GitHub.