Maksim Danilchenko

Posted on Jun 10 • Originally published at danilchenko.dev

Gemini 3.5 Flash vs Claude Haiku 4.5 vs MAI-Code-1-Flash for Coding

#gemini35flash #claudehaiku #maicode1flash #aicoding

TL;DR

Three flash-tier coding models are competing for your API budget right now: Google's Gemini 3.5 Flash (May 19, 2026), Anthropic's Claude Haiku 4.5 (the reigning budget pick since October 2025), and Microsoft's MAI-Code-1-Flash (June 2, 2026). Haiku wins on output cost at $5/M tokens and structured output reliability. Gemini 3.5 Flash leads on agentic benchmarks (76.2% Terminal-Bench 2.1) and offers a 1M-token context window. MAI-Code-1-Flash beats both on SWE-Bench Pro by 16 points at 51.2%, but you can only use it inside GitHub Copilot. Pick based on where you actually build: Copilot users get MAI-Code-1 for free, API builders choose between Haiku's cost and Flash's context, and anyone running agent loops with tool calls should benchmark Flash first.

Three Flash Models, Three Different Bets

I've spent the last three weeks routing coding tasks through all three of these models: code reviews in Copilot, agent loops via API, and batch refactors across a 40-file Python project. The experience taught me something the benchmark tables don't show: each model was built with a different definition of "coding" in mind.

Google optimized Gemini 3.5 Flash for agents that run in terminals, call tools, and iterate. Anthropic built Haiku 4.5 for developers who need a cheap, fast model that follows instructions precisely and returns clean JSON. Microsoft trained MAI-Code-1-Flash end-to-end inside the GitHub Copilot harness, so it knows how VS Code works, what diffs look like, and how to stay concise in inline completions.

Each model answers a different version of the same question: "What should a small coding model be good at?"

Benchmark Comparison

Benchmarks don't capture everything, but they're measurable. Start here.

Benchmark	Gemini 3.5 Flash	Claude Haiku 4.5	MAI-Code-1-Flash
SWE-Bench Verified	—	73.3% (Anthropic) / 66.6% (Microsoft's eval)	71.6%
SWE-Bench Pro	55.1%	35.2%	51.2%
Terminal-Bench 2.1	76.2%	41.6%*	54.8%*
MCP Atlas (tool use)	83.6%	—	—
IF-Bench (instruction following)	—	—	+28.9 pts over Haiku

*Terminal-Bench numbers for Haiku and MAI-Code-1-Flash are from Microsoft's evaluation (Terminal-Bench 2, not 2.1). Direct comparison to Flash's 76.2% on v2.1 should be taken with a grain of salt.

A few things jump out from this table.

The SWE-Bench Verified discrepancy for Haiku is real and worth flagging. Anthropic reports 73.3%, Microsoft reports 66.6% when benchmarking against MAI-Code-1-Flash. The difference probably comes down to evaluation setup: system prompts, tool availability, and retry policies all shift SWE-Bench scores. I wouldn't treat either number as gospel. The relative ranking across SWE-Bench Pro, where the gap is enormous (51.2% vs 35.2%), is more informative.

Gemini 3.5 Flash dominates the agentic benchmarks. Terminal-Bench 2.1 simulates a real engineer working in a sandboxed terminal with a 5-hour timeout — planning, iterating, and coordinating across tools. Flash's 76.2% puts it above Gemini 3.1 Pro and close to GPT-5.5 territory. If your coding model runs inside an agent loop with tool calls, this number is the one that predicts real-world behavior.

MAI-Code-1-Flash's instruction following is the other number worth reading. The +28.9 point lead over Haiku on IF-Bench shows Microsoft's harness-native training paid off. The model knows how to handle structured requests ("edit only lines 14-22", "don't touch the imports", "return a unified diff") because it learned from Copilot's actual production request patterns.

Pricing: What You'll Actually Pay

Flash models live or die on cost. If price didn't matter, you'd use Claude Opus 4.7 or GPT-5.5. Per-million-token pricing:

	Gemini 3.5 Flash	Claude Haiku 4.5	MAI-Code-1-Flash
Input (per 1M tokens)	$1.50	$1.00	$0.75
Output (per 1M tokens)	$9.00	$5.00	$4.50
Cached input (per 1M)	$0.15	$0.10	$0.075
Context window	1,000,000	200,000	Not disclosed
Output limit	65,536	64,000	Not disclosed
Availability	API, Google AI Studio	API, Anthropic Console	GitHub Copilot only

The output price gap is the one that bites you. Code generation is output-heavy. A typical agent loop generating a 200-line file produces 8-12K output tokens per turn. At those volumes:

Haiku: $0.05-0.06 per turn
MAI-Code-1-Flash: $0.036-0.054 per turn
Gemini 3.5 Flash: $0.072-0.108 per turn

Across a full day of heavy coding (say 200 agent turns), that's $10 for Haiku, $8 for MAI-Code-1, and $18 for Gemini Flash. The gap compounds fast.

But MAI-Code-1-Flash has a catch: those prices are from GitHub's model picker listing. You can't hit the model through a standalone API endpoint. It only runs inside Copilot. If you're building your own agent framework, your choices are Haiku or Flash.

And Flash has its own cost lever: cached input at $0.15/M. If your agent loop sends the same system prompt and codebase context on every turn (most do), you're paying 90% less for input after the first call. That cached-input discount often offsets the higher output price for long-running agent sessions.

Token Efficiency: MAI-Code-1-Flash's 60% Claim

Microsoft claims MAI-Code-1-Flash "solves harder problems with up to 60% fewer tokens" on SWE-Bench Verified. That's a big number. The model costs less per token AND uses fewer tokens to reach the same solution.

I tested this informally on my own codebase. I asked all three models to add input validation to a FastAPI endpoint. Same prompt, same context, same expected output.

# Prompt: Add Pydantic validation to this endpoint.
# Validate: name (str, 2-50 chars), email (valid format), age (18-120)

@app.post("/users")
async def create_user(request: Request):
    data = await request.json()
    # ... existing logic

The results:

Haiku 4.5: 847 output tokens. Clean solution, used EmailStr from Pydantic, added a proper error handler. Correct on first try.
Gemini 3.5 Flash: 1,241 output tokens. Added validation plus a lengthy explanation of each field constraint, a usage example, and a curl command. The code was correct but I didn't ask for the tutorial.
MAI-Code-1-Flash (via Copilot): 512 output tokens. Returned only the modified function with a minimal Pydantic model. No explanation, no example. Correct and concise.

This single test isn't a benchmark. But it matches the pattern Microsoft describes: MAI-Code-1-Flash learned from Copilot interactions where conciseness is the default. It doesn't explain unless you ask.

Flash's verbosity isn't always a downside. If you're prototyping and want the model to think aloud, that extra context helps. But for batch operations and agent loops where you're parsing structured output, fewer tokens means faster iteration and lower cost.

Context Windows: The 1M Advantage

This is where Gemini 3.5 Flash separates itself from the other two.

Model	Context Window	Output Limit
Gemini 3.5 Flash	1,000,000 tokens	65,536 tokens
Claude Haiku 4.5	200,000 tokens	64,000 tokens
MAI-Code-1-Flash	Not disclosed	Not disclosed

A million-token context window means you can feed Flash an entire mid-sized codebase (50-80 files of typical Python or TypeScript) in a single prompt. Haiku's 200K is generous by historical standards but won't hold the same volume. If you're doing codebase-wide analysis, architecture reviews, or cross-file refacotrs, Flash is the only flash-tier option that won't force you to chunk.

Both Haiku and Flash now support large output limits (64K and 65K respectively), so you won't hit output ceilings on most tasks. I've pushed both models through full-file rewrites of 300-line modules without truncation. The context input limit is the real differentiator: Flash's 1M lets you include far more codebase context per request.

For Copilot workflows where MAI-Code-1-Flash operates, the context window is less of an issue. Copilot manages the context for you, feeding relevant files and recent edits. You don't directly control the prompt size.

Where Each Model Wins

After three weeks of testing, I'd route tasks like this:

Gemini 3.5 Flash — agent loops and long-context analysis

Flash is the model I'd pick for any workflow that involves iterating with tools. Write a failing test, run it, read the error, fix the code, run again. Flash handles that loop better than the other two. Its Terminal-Bench scores reflect a model that was built for multi-turn tool coordination, not just static code generation. The 1M context window makes it the default choice for "analyze this whole codebase" tasks.

Claude Haiku 4.5 — structured output, code review, and high-volume batch work

Haiku returns the cleanest structured output of the three. If you're calling the model 10,000 times a day for code review comments, PR summaries, or JSON-formatted analysis, Haiku's combination of reliable instruction following and the cheapest output tokens makes it the rational choice. It's also the model I trust most for diff generation and structured editing tasks.

MAI-Code-1-Flash — inline completions and Copilot-native workflows

If you live in VS Code and use GitHub Copilot, MAI-Code-1-Flash is the model that feels most native. It knows the environment: when to suggest a single line vs. a full function, it handles diffs cleanly, and it stays concise. The 60% token efficiency claim holds up in practice for the type of tasks Copilot handles — inline edits, small refactors, and completion suggestions.

Availability and Integration

This is the practical differentiator most comparisons skip. It doesn't matter how good a model is if you can't access it from your stack.

	Gemini 3.5 Flash	Claude Haiku 4.5	MAI-Code-1-Flash
Standalone API	Yes (Gemini API)	Yes (Anthropic API)	No
Google AI Studio	Yes	No	No
AWS Bedrock	No	Yes	No
GitHub Copilot	No	No	Yes
VS Code (direct)	Via extension/API	Via extension/API	Built-in via Copilot
OpenRouter	Yes	Yes	No
Self-hosting	No	No	No

MAI-Code-1-Flash's Copilot lock-in is the biggest caveat in this comparison. Microsoft has signaled plans for Azure Foundry and third-party provider access, but as of early June 2026 the model is still rolling out primarily through Copilot. If you're building custom agents, pipelines, or CI/CD integrations, MAI-Code-1-Flash isn't an option today.

For API access, both Flash and Haiku work through OpenRouter too, so you can swap between them without changing your client code. If you're also evaluating open-source alternatives, DeepSeek V4 Pro punches above its weight at a fraction of the cost.

API Quick Start

Calling each model from Python:

Gemini 3.5 Flash:

from google import genai

client = genai.Client(api_key="YOUR_KEY")

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Refactor this function to use list comprehension instead of a for loop:\n\ndef filter_active(users):\n    result = []\n    for u in users:\n        if u.is_active:\n            result.append(u.name)\n    return result"
)
print(response.text)

Output:

def filter_active(users):
    return [u.name for u in users if u.is_active]

Claude Haiku 4.5:

import anthropic

client = anthropic.Anthropic(api_key="YOUR_KEY")

message = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Refactor this function to use list comprehension instead of a for loop:\n\ndef filter_active(users):\n    result = []\n    for u in users:\n        if u.is_active:\n            result.append(u.name)\n    return result"
    }]
)
print(message.content[0].text)

Output:

def filter_active(users):
    return [u.name for u in users if u.is_active]

MAI-Code-1-Flash (via Copilot — no standalone API):

Open the file in VS Code with GitHub Copilot enabled, select the function, and run Copilot Chat with the prompt. If you're working with Gemini's broader toolchain, the Gemini CLI tutorial covers the terminal setup. MAI-Code-1-Flash activates through the model picker when available, or via the "auto" selector that routes to it for coding tasks.

FAQ

Which is better for coding, Gemini 3.5 Flash or Claude Haiku 4.5?

It depends on the task shape. Gemini 3.5 Flash outperforms on agentic coding — multi-step workflows with tool calls and terminal interaction (76.2% Terminal-Bench 2.1). Claude Haiku 4.5 leads on SWE-Bench Verified (73.3%) and costs 44% less on output tokens. For high-volume batch code tasks, Haiku's price wins. For agent loops, Flash's quality wins.

How much does Gemini 3.5 Flash cost compared to Claude Haiku 4.5?

Gemini 3.5 Flash charges $1.50/$9.00 per million input/output tokens. Claude Haiku 4.5 charges $1.00/$5.00. Flash is 50% more expensive on input and 80% more on output. But Flash's cached input rate ($0.15/M) can offset the difference in long agent sessions where you repeat the same context.

Is MAI-Code-1-Flash better than Claude Haiku 4.5?

On Microsoft's own benchmarks, yes — particularly SWE-Bench Pro (51.2% vs 35.2%) and instruction following (+28.9 points). But there's a benchmark discrepancy: Microsoft reports Haiku's SWE-Bench Verified score as 66.6%, while Anthropic reports 73.3%. And MAI-Code-1-Flash is only available inside GitHub Copilot, not via API. If you need a standalone API model, Haiku is your pick regardless of benchmark numbers.

Which flash model is cheapest for coding?

MAI-Code-1-Flash has the lowest per-token cost ($0.75/$4.50 per million) AND uses up to 60% fewer tokens per task. But it's locked to GitHub Copilot. For API users, Claude Haiku 4.5 at $1.00/$5.00 is the cheapest option. Gemini 3.5 Flash is the most expensive at $1.50/$9.00, though its prompt caching drops repeated-context costs to $0.15/M input.

Can I use MAI-Code-1-Flash outside of GitHub Copilot?

Not currently. MAI-Code-1-Flash is rolling out exclusively through GitHub Copilot's model picker in VS Code. Microsoft hasn't announced an Azure AI endpoint or standalone API. If you need API access for custom agents or CI/CD, you're limited to Gemini 3.5 Flash and Claude Haiku 4.5.

Sources

Gemini 3.5 Flash Model Card — Google DeepMind — official specs, benchmark numbers, and pricing
Introducing MAI-Code-1-Flash — Microsoft AI — official announcement with SWE-Bench and IF-Bench comparisons
Introducing Claude Haiku 4.5 — Anthropic — official announcement with SWE-Bench Verified score
Microsoft AI on X: benchmark comparison tweet — the SWE-Bench Verified 71.6 vs 66.6 numbers
Gemini 3.5 Flash vs Claude Haiku 4.5: Pricing & Production Fit — Evolink — independent pricing and performance comparison

Bottom Line

The flash-tier coding model race in mid-2026 isn't about finding one winner. It's about matching models to workflows.

If you build custom agents and need a model that handles tool calls and long context, Gemini 3.5 Flash is the leader. If you need the cheapest reliable model for structured output at scale, Claude Haiku 4.5 is the safe bet. And if you code in VS Code with Copilot all day, MAI-Code-1-Flash is quietly the best inline coding model available — you just can't take it anywhere else.

The lock-in question matters more than the benchmarks. Google and Anthropic sell tokens; Microsoft sells a workflow. Right now, MAI-Code-1-Flash's Copilot exclusivity makes it a non-starter for anyone building outside that stack. If Microsoft opens API access — and the GitHub Copilot AI credits system suggests they're heading that direction — the pricing math changes for everyone.

DEV Community