DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aicoderscope.com

Microsoft MAI-Code-1-Flash in GitHub Copilot 2026: 137B/5B MoE, Sub-Second Latency, and Whether It Beats Claude Haiku 4.5 on Your Bill

This article was originally published on aicoderscope.com

TL;DR: Microsoft's first in-house coding model, MAI-Code-1-Flash, hit general availability for Copilot Business and Enterprise on June 26, 2026. It is a 137B-total / 5B-active sparse MoE built for speed, and Microsoft's own numbers put it ahead of Claude Haiku 4.5 on every coding benchmark they ran — while using up to 60% fewer tokens. For the cheap-and-fast lane of your Copilot stack, it is now the default to beat. It is not a replacement for a frontier model on hard multi-file refactors.

MAI-Code-1-Flash Claude Haiku 4.5 A frontier model (Sonnet 4.6 / GPT-5.5)
Best for Fast inline edits, autocomplete, small functions, cheap agent loops Same lane, slightly weaker on agentic tasks Deep multi-file refactors, architecture, planning
Copilot token price (in / out per 1M) $0.75 / $4.50 $1.00 / $5.00 Several times higher
SWE-Bench Pro 51.2 35.2 Higher, at much higher cost
The catch 5B active params — shallow on complex reasoning Pricier and slower for the same tier Burns AI Credits fast under usage billing

Honest take: If your daily driver is the cheap-tier model in Copilot's picker, switch it to MAI-Code-1-Flash today — it is faster, scores higher, and bills lower than Claude Haiku 4.5. Keep a frontier model one click away for the hard problems; this is a workhorse, not a brain.

Microsoft has been licensing other people's models inside GitHub Copilot for years. MAI-Code-1-Flash is the first coding model it trained end-to-end itself, and the timing is not subtle: it lands three weeks into Copilot's usage-based billing era, where every token you spend shows up on a bill. A model that is both cheaper per token and uses fewer tokens is a direct answer to the billing backlash that followed the June 1 AI Credits switch.

What MAI-Code-1-Flash actually is

It is a sparse Mixture-of-Experts transformer: 137 billion total parameters, but only 5 billion active per token. That split is the whole design. The 5B active count is why it responds fast and bills cheap — the model only lights up a small slice of its weights for any given request. The 137B total is why it still has enough knowledge to score respectably on real benchmarks instead of collapsing the way a dense 5B model would.

Context window is 256K tokens, which is generous for a model in this class and enough to hold a meaningful slice of a repo in an agent loop. Microsoft says it was trained from March to May 2026 on "clean and appropriately licensed data," and — the more interesting claim — trained directly against the GitHub Copilot harness developers use in production, not against synthetic benchmark suites. Whether that holds up in your codebase is the open question, but it is at least a different bet than "train on everything and hope."

The other architectural note worth knowing is adaptive thinking. The model allocates almost no reasoning budget to a simple autocomplete and expands to multi-step reasoning only when the task looks like a refactor or an architecture question. That is how it avoids the latency tax of always-on chain-of-thought while keeping quality up on the harder requests. In practice it means tab completion stays instant and the agent does not stall on trivial edits.

The benchmark numbers (and the asterisk)

Here are Microsoft's published results, MAI-Code-1-Flash vs Claude Haiku 4.5:

Benchmark MAI-Code-1-Flash Claude Haiku 4.5 Delta
SWE-Bench Verified 71.6 66.6 +5.0
SWE-Bench Pro 51.2 35.2 +16.0
Terminal Bench 2 54.8 41.6 +13.2
Token use (SWE-Bench Verified) up to 60% fewer baseline

The asterisk: these are Microsoft's own numbers, comparing its new model against one specific competitor it chose. That is not the same as an independent leaderboard, and the comparison is pointedly against Claude Haiku 4.5 — Anthropic's small, cheap model — not against Sonnet 4.6 or any frontier tier. The honest reading is "MAI-Code-1-Flash is now the strongest model in the budget lane," not "MAI-Code-1-Flash beats Claude." On Terminal-Bench 2's public leaderboard, frontier models like GPT-5.5 sit far above 54.8. Keep the comparison in its weight class.

That said, +16 points on SWE-Bench Pro over the model it is replacing is a large gap for a same-tier swap, and the token-efficiency claim is the part that actually shows up on your invoice.

The billing angle: why "60% fewer tokens" matters now

Before June 1, 2026, the model you picked in Copilot barely affected your bill — you paid a flat Premium Request multiplier. After the usage-based billing switch, you burn AI Credits against each model's per-token list price, and the model choice is now a line item.

Verified Copilot list prices, per 1M tokens:

Model Input Cached input Output
MAI-Code-1-Flash $0.75 $0.075 $4.50
Claude Haiku 4.5 $1.00 $5.00

MAI-Code-1-Flash is 25% cheaper on input and 10% cheaper on output before you count token efficiency. Stack the "up to 60% fewer tokens" claim on top and the gap compounds.

A worked example. Take a typical agentic edit session — roughly 50K input tokens (the model reads your files) and 8K output tokens (it writes the diff):

  • Claude Haiku 4.5: 50K × $1.00/M + 8K × $5.00/M = $0.050 + $0.040 = $0.090/session
  • MAI-Code-1-Flash, same token count: 50K × $0.75/M + 8K × $4.50/M = $0.0375 + $0.036 = $0.074/session
  • MAI-Code-1-Flash, if it hits the 60%-fewer-tokens claim: ~20K × $0.75/M + ~3.2K × $4.50/M ≈ $0.029/session

At a few hundred sessions a month, the difference between $0.09 and $0.03 per session is the difference between staying inside your credit allotment and getting an overage bill. The 60% figure is benchmark-derived, not a guarantee for your repo — treat the $0.074 number as the floor you can count on and anything below it as upside.

On the legacy request-based billing that annual Pro and Pro+ plans can still be on, MAI-Code-1-Flash carries a 0.33 model multiplier — a promotional rate at the time of writing. Multipliers are a legacy-billing concept and do not apply under the new usage-based system; if you are on usage billing, the per-token prices above are what count.

Availability: who can actually pick it today

This rolled out in stages, and the staging matters because "GA" did not mean "everyone, everywhere" at any single point:

  • May 29, 2026 — initial rollout to Copilot Free, Pro, Pro+, and Max, VS Code first.
  • June 18, 2026 — expanded to more surfaces: Copilot CLI, the Copilot cloud agent, the GitHub Copilot app, Copilot Chat on GitHub, Visual Studio, GitHub Mobile, JetBrains IDEs, Eclipse, and Xcode.
  • June 26, 2026 — general availability for Copilot Business and Copilot Enterprise.

The Business/Enterprise GA has a gotcha that tripped up a lot of teams: an admin has to flip the policy on first.

The problem you'll actually hit: "it's not in my model picker"

The single most common complaint in the GitHub community thread is developers on Business or Enterprise plans who read the GA announcement, open VS Code, and find no MAI-Code-1-Flash in the dropdown. Three things cause it, in order of likelihood:

  1. Admin policy not enabled. For Copilot Business and Enterprise, an administrator must enable the MAI-Code-1-Flash policy in Copilot settings before any seat can select it. This is the big one — GA for the plan does not auto-enable the model for the org.
  2. Gradual rollout. Microsoft confirmed it is still rolling the model out gradually even within enabled plans. If the policy is on and you still don't see it, you may simply be in a later wave.
  3. Stale client. Older VS Code or extension builds won't surface it. Update both.

The fix sequence:


bash
# 1. Confirm your Copilot extension and 
Enter fullscreen mode Exit fullscreen mode

Top comments (0)