Gemini 3.5 Flash beat 3.1 Pro on coding and agents

#ai #llm #gemini #news

Gemini 3.5 Flash scored 76.2% on Terminal-Bench 2.1. Gemini 3.1 Pro — the tier above it in Google's own lineup — scored 70.3%.

Google shipped Flash at I/O 2026 on May 19. It costs $2.50 per million input tokens and $15 per million output, which is 40% cheaper than 3.1 Pro on both sides, and Google reports it generates output tokens at roughly four times the rate of comparable frontier models. The headline most people will see is "Flash beats Pro." The more useful thing to know is where it beats Pro, and where it doesn't.

A quick orientation for anyone new to Google's lineup: Flash is the speed/cost tier, Pro sits above it, Ultra sits above Pro. Until last week the rule was simple — pick Flash for cheap, fast, "good enough" work, then escalate to Pro when the task needed real intelligence. The 3.5 release blurs that rule for one specific kind of work.

Where the Flash tier now leads

The wins are concentrated in coding and agentic work — the tasks that look most like an LLM being plugged into a tool loop.

Terminal-Bench 2.1 is a benchmark for agents driving a terminal: opening files, running shell commands, debugging real codebases. Flash scores 76.2%, Pro 70.3%. A 5.9-point lead on the benchmark closest to "is this thing useful inside Cursor or Aider."

MCP Atlas measures tool-calling correctness against MCP servers — whether the model picks the right tool, fills the right arguments, and recovers from errors. Flash scores 83.6%, Pro 78.2%. On this one Flash also leads every other model Google reports against, including Claude Opus 4.7 and GPT-5.5.

Finance Agent v2 is a long-horizon agent eval where the model has to research a financial question end-to-end across many calls. Flash scores 57.9%, Pro 43.0%. The 14.9-point gap is the largest in the suite, and the benchmark rewards staying coherent across many tool calls — exactly the failure mode that bites agent stacks in production.

GDPval-AA, which scores agentic adversarial tasks via Elo, has Flash at 1656 and Pro at 1314. Flash also tops Google's own table for Toolathlon (56.5%), CharXiv Reasoning (84.2%), and MMMU-Pro (83.6%). On the OSWorld desktop-agent benchmark, Flash sits at 78.4% — within noise of GPT-5.5 at 78.7% and Claude Opus 4.7 at 78.0%.

The pattern is consistent. When the task involves picking tools, calling them, reading the output, and trying again, the new Flash tier ships a model that is ahead of the older Pro tier — and competitive with the current frontier from OpenAI and Anthropic.

The two places Pro still holds

Two benchmarks went the other way. Both are intelligence-ceiling tests, not agent tests.

Humanity's Last Exam is a curated set of expert-level questions designed to resist the patterns LLMs learn. Pro scores 44.4%, Flash 40.2% — a 4.2-point gap in Pro's favor.

ARC-AGI-2 is the abstract-reasoning benchmark where most models still score in the single digits. Pro scores 77.1%, Flash 72.1%. Another five points, again in Pro's favor.

These aren't agent benchmarks. They're "can the model think hard about a novel problem with no tools" benchmarks. And on those, the Flash speed-and-cost trade still has a cost — the bigger Pro model retains a measurable edge.

That's the shape of the trade Google made. Flash got better at doing, not at thinking from scratch. If the workload is an agent picking tools and recovering from errors, Flash is now the right default. If the workload is one-shot, novel reasoning with nothing else, Pro is still preferred.

The decision rule that falls out of this is concrete. Any production stack routing coding-agent or tool-calling work through Gemini 3.1 Pro can probably switch to Gemini 3.5 Flash for 40% less per token, roughly four times faster generation, and a measurable benchmark improvement on the agentic side. The one-shot reasoning calls — the ones that hit Humanity's Last Exam-style territory — keep going to Pro. The 3.5 release doesn't retire the Pro tier; it raises the floor underneath it.

Source: Gemini 3.5: frontier intelligence with action (Google blog, May 19, 2026). Benchmark numbers from the same launch post and the llm-stats roundup.