pueding

Posted on Jun 4 • Originally published at learnaivisually.com

Microsoft MAI-Code-1-Flash: Adaptive Solution-Length Control

#ai #machinelearning #llm

What: Microsoft's first in-house coding model, MAI-Code-1-Flash (launched at Build 2026 alongside the MAI-Thinking-1 reasoner), ships adaptive solution-length control — the model decides how many reasoning tokens to spend based on how hard the task is.

Why: Reasoning tokens are the dominant cost of a thinking model: every token is a decode step you pay for in latency and dollars. Spending the same long chain on a one-line fix as on a cross-file refactor wastes most of that budget — adaptive length spends it only where it buys accuracy.

vs prior: A fixed-budget reasoner thinks to roughly the same length on every prompt; adaptive control stops at the point the answer is reached, so Microsoft reports the model hitting its scores with up to 60% fewer tokens than a flat budget would burn.

Think of it as

A test-taker who budgets minutes by how hard each question looks.

              SAME EXAM: easy · medium · hard
                            │
            FIXED BUDGET    │    ADAPTIVE CONTROL
            same slot/Q     │    slot sized to Q
                            │
      easy   ██████████ 10m │ easy   █ 0.5m
      medium ██████████ 10m │ medium █████ 6m
      hard   ██████████ 10m │ hard   ██████████ 10m
                            │
       ✗ burns minutes on   │  ✓ same score,
         the easy ones      │    far fewer minutes

task = one exam question
reasoning tokens = minutes spent working it out
fixed budget = giving every question the same long time slot
adaptive control = sizing the minutes to each question's difficulty
answer reached = the moment you have it, so you move on

Quick glossary

Reasoning tokens — The intermediate "thinking" tokens a model generates one at a time before its final answer (the chain-of-thought). More reasoning tokens = more decode cost.

Solution length — How long that reasoning chain runs before the model commits to an answer. Adaptive solution-length control lets the model choose this length per task instead of using a fixed cap.

Test-time compute — Compute spent at inference (not training) — chiefly by generating more reasoning tokens. Spending more usually helps hard problems and is wasted on easy ones.

SWE-Bench Pro / Verified — Benchmarks of real GitHub issues a model must resolve with working code. Microsoft reports MAI-Code-1-Flash at 51.2% on SWE-Bench Pro vs Claude Haiku 4.5's 35.2%, using up to 60% fewer tokens on SWE-Bench Verified.

Sparse MoE — Mixture-of-Experts: each token is routed through a small subset of "expert" sub-networks. MAI-Thinking-1, the reasoner alongside the coding model, is a sparse MoE with 35B active parameters and a 256K context.

Underthinking — Stopping the chain too early and committing to a wrong answer — the failure mode a fixed-minimum budget risks, and the reason a good stop signal is the hard part of adaptive length.

The news. On June 2, 2026, at Build 2026, Microsoft introduced its first in-house frontier models — MAI-Thinking-1 (a 35B-active sparse MoE reasoner with a 256K context, which Microsoft says was trained from scratch on licensed data with no distillation from third-party models) and MAI-Code-1-Flash, a small, inference-efficient coding model built end-to-end by Microsoft and rolling out to GitHub Copilot users in VS Code. MAI-Code-1-Flash reportedly leads Claude Haiku 4.5 by 16 points on SWE-Bench Pro (51.2% vs 35.2%) while using up to 60% fewer tokens, which it credits to adaptive solution-length control. Read the announcement →

Picture the test-taker for a second. Two students sit the same exam. The first was told to spend exactly ten minutes per question — so she burns the full ten on "2 + 2," sits there second-guessing a settled answer, and runs short on the proof at the end. The second reads each question, sizes up the effort, and moves on the moment she's sure — thirty seconds on the arithmetic, the full ten on the proof. Same paper, same score, far less time. Adaptive solution-length control is the second student: the model spends its reasoning where difficulty actually demands it, instead of paying a flat tax on every task.

Under the hood, the "minutes" are reasoning tokens. A thinking model generates its chain-of-thought one token at a time before answering, and every one of those tokens is a decode step you pay for in latency and dollars. A fixed budget sets one length for all prompts; adaptive control instead decides how long to keep thinking and, crucially, when to stop. Microsoft hasn't disclosed the exact controller — whether the length is learned, predicted up front, or a learned stop signal mid-chain — so treat the mechanism as undisclosed; what's reported is the outcome: the same benchmark scores at a fraction of the tokens.

Where the tokens actually go

A back-of-envelope walk-through (illustrative numbers; the 60% figure is Microsoft's). Take three Copilot tasks: an easy one-line fix, a medium multi-step bug, and a hard cross-file refactor. A fixed budget of ~2,000 reasoning tokens spends all three the same way → ~6,000 tokens total, even though the easy fix had its answer after ~200. Adaptive control stops each chain at its answer — roughly ~200 + ~650 + ~1,650 ≈ ~2,500 tokens — for the same result. That's ~58% fewer tokens in this toy mix, right in line with the up to 60% fewer Microsoft reports. The hard task barely changes; the savings come almost entirely from not over-thinking the easy and medium ones.

Three ways to set the reasoning length

Strategy	Easy task	Hard task	Main risk
Fixed-max budget	thinks far past the answer	fits — has room	over-thinking: burns tokens it doesn't need
Fixed-min budget	fits — short is fine	cut off too early	underthinking: commits to wrong answers
Adaptive control	short chain	long chain	needs a reliable stop signal

The catch lives in that last cell. A fixed budget is dumb but safe; adaptive length is only as good as its sense of when it's done. Stop one token too early on a hard task and you get underthinking — a confident wrong answer that's worse than a slow right one. That's why the headline number is a coding model's: in software, a test or verifier can often tell the model whether it's actually done, giving the stop signal something concrete to lean on. The win is real and specific — fewer reasoning tokens for the same accuracy — and it rides entirely on getting that stop right.

Goes deeper in: AI Agents → Planning & Reflection → Reasoning budget

Related explainers

Compute Where It Counts — Per-token compute controller — the other axis of adaptive compute: how much work each token gets, vs how many tokens the chain runs
LongTraceRL — Rubric reward (process supervision) — how reasoning chains get trained, where a good stop signal would come from
Gemini 3.5 Flash — Agent-first model design — a related angle: building a model for the agent loop rather than retrofitting chat

FAQ

What is adaptive solution-length control?

It's a model's ability to scale the length of its reasoning chain to the difficulty of the task. Instead of a fixed cap on reasoning tokens for every prompt, the model spends a short chain on easy tasks and a long one only on hard tasks, stopping when it has reached an answer. Microsoft's MAI-Code-1-Flash uses it to hit its benchmark scores with up to 60% fewer tokens than a flat budget would use.

Why does it save so much without losing accuracy?

Because the savings come from tasks that were over-thought, not under-thought. On an easy fix, a long reasoning chain reaches the answer early and then generates tokens past it — those extra tokens cost latency and money but don't change the result. Trimming the chain to the point the answer was reached removes pure waste. Hard tasks, which genuinely need a long chain, are barely affected.

How is it different from a per-token compute controller?

They tune different dials. A per-token compute controller (as in the "Compute Where It Counts" paper) changes how much compute each individual token gets — attention sparsity, layer pruning, bit-width. Adaptive solution-length control changes how many reasoning tokens the chain runs in total. One sizes the work per token; the other sizes the number of tokens. They're complementary.

Originally posted on Learn AI Visually.

DEV Community