pueding

Posted on May 29 • Originally published at learnaivisually.com

Gemini 3.5 Flash: Agent-First Model Design

#ai #machinelearning #llm #agents

What: Gemini 3.5 Flash, announced by Google DeepMind on May 25, 2026, is positioned as an agent-first model — a Flash-class LLM whose product story is the Antigravity multi-step harness rather than chat.

Why: Every production agent lives inside a loop (call a tool, observe, decide, possibly fail, retry) and a model trained only on chat treats that loop as foreign territory, while an agent-first model treats it as native habitat.

vs prior: The previous default was a chat-tuned base model wrapped in a tool-use harness (parsers, retry wrappers, validators); agent-first models bake those harness behaviours into training so the harness has less work to do.

Think of it as

A native speaker vs a tourist with a phrasebook.

                      THE TOOL-USE LOOP
                             │
               ┌─────────────┴─────────────┐
               │                           │
       ┌───────▼────────┐        ┌─────────▼────────┐
       │  Agent-first   │        │  Chat model +    │
       │ native speaker │        │  tool harness    │
       │                │        │ tourist+phrasebk │
       └───────┬────────┘        └─────────┬────────┘
               │                           │
      speaks tool-calls          looks each phrase up,
      natively; recovers         re-prompts the user
      from errors in-loop        when a tool errors
               │                           │
               ▼                           ▼
        ✓ ~5 turns total          ✗ ~8-11 turns total
          thin harness              heavy harness

agent-first model = a native speaker who learned the language of tool-calling from childhood
chat model with tools bolted on = a tourist working through a phrasebook one phrase at a time
tool call = a single phrase the model has to speak correctly
hallucinated tool call = a mistranslated phrase that does not exist in the language
error recovery = the moment the listener says I do not understand

Quick glossary

Agent loop — The basic control flow of any agent: the model emits an action (typically a tool call), the harness executes it, the model reads the observation, and the cycle repeats until the model emits a final answer. See AI Agents → The Agent Loop.

Tool-use harness — The runtime wrapper around an LLM that turns model output into real side effects: a parser that extracts function calls, a dispatcher that runs them, retry / timeout logic, and error formatting back into the prompt. See harness anatomy.

Function calling — The protocol where an LLM emits a structured function name + JSON arguments instead of free-text. The model is given a tool schema in the prompt; the harness validates and executes. Schema design is where most failures originate.

Hallucinated tool call — The model emits a call to a function name that does not exist in the tool schema, or with the wrong argument shape. The harness must catch it and feed an error back. Frequency drops sharply when tool-call traces are part of post-training.

Compounding error rate — Per-turn accuracy compounds multiplicatively across the loop: a 95%-per-turn model has roughly 60% chance of running a 10-turn task without a single error. Agent-first training targets this compounding directly. See Evals → Compounding.

Antigravity harness — Google's agent runtime that Gemini 3.5 Flash is positioned around — collaborative subagents, multi-step workflow execution. Described publicly alongside the model launch. The harness is a product, not a model feature.

MCP Atlas — A benchmark that scores how well an LLM operates over a real MCP server stack — multi-tool, multi-turn, with realistic schemas. Used by Google to position Gemini 3.5 Flash against other frontier-class models.

The news. On May 25, 2026, Google DeepMind announced Gemini 3.5 Flash, the first model in the Gemini 3.5 series. The framing is explicitly agent-first: the blog pairs the model with the Antigravity harness for collaborative subagents and the Frontier Safety Framework for interpretability-based safety checks. Headline scores include Terminal-Bench 2.1 at 76.2%, MCP Atlas at 83.6%, and GDPval-AA at 1656 Elo, with a claim of roughly 4× the output tokens / second of competing frontier models. Architectural details — parameter count, context window, training recipe — are not disclosed.

Picture the metaphor for a moment. A tourist with a phrasebook can order coffee. They flip to the right page, read the phrase slowly, get the syllables roughly right. When the barista says something back, they look it up. When the response is unexpected — "what size?" — they fumble. They get there, but every interaction is a discrete look-up, and the cost of failure is a re-look-up. A native speaker doesn't translate; they hear the question and respond in the same beat, and when something unexpected lands they handle it without dropping the thread. That is the gap between a chat model with a tool-use harness and an agent-first model. The chat model speaks the language of tool calls one phrase at a time through a wrapper; the agent-first model speaks it natively because it learned in that environment.

What changes in training — at least in the version of the story the public framing implies — is the substrate. A chat-tuned base model has seen billions of words of human dialogue and roughly nothing of tool-call traces; function-calling is typically taught at fine-tune time as a structured-output discipline. Agent-first models, by contrast, are characterized by tool-call traces — call, observation, next call, error, recovery, success — appearing in heavy post-training (and in some cases earlier). The model's prior on "what happens next" after a 500 response is no longer "be a helpful chatbot," it's "retry with backoff or pick a different tool." Google does not disclose Gemini 3.5 Flash's training recipe, so the loss-shaping argument here is an interpretation of the agent-first product positioning (Antigravity, MCP Atlas benchmarks, Frontier Safety Framework framing), not a quoted architectural claim.

The under-appreciated piece is the harness architecture on the other side. A chat-with-tools deployment needs an aggressive harness — JSON-schema validators that reject hallucinated function names, retry wrappers that catch tool errors and rephrase them as user messages, a planner module that re-prompts on stuck loops. Each of those layers exists because the model itself does not natively know it is inside a loop; the harness has to keep telling it. As models move agent-first, harness mass tends to shift back into the model: fewer parsers, fewer retry wrappers, simpler observability spans because each turn is shorter and the compounding error rate per turn is lower. The harness becomes thin — it shuttles inputs and outputs, it does not police behaviour.

What "agent-first" actually changes — line by line

Behaviour	Chat model + tool-use harness	Agent-first model
Function name correctness	Hallucinated names appear; harness rejects them and re-prompts (setup-dependent, illustrative)	Function names are part of the training distribution — closer to the corpus, hallucinated less
Argument-shape correctness	JSON schema violations on first attempts — harness catches and retries (setup-dependent, illustrative)	Structured outputs are native; shape errors fall off — see structured outputs
Tool-error recovery	Treats 500s as conversational surprise; may re-ask the user	Treats 500s as in-loop signal: backoff, alternative tool, fail-task
Multi-step planning horizon	Plans 1–3 turns ahead; long horizons drift	Trained against trajectories of 10+ tool calls; horizon stays coherent
Harness complexity	Heavy: parsers, validators, retry wrappers, planner modules	Thin: dispatch tool calls, format observations
Headline pitch	"Use this chat model for agents (with these wrappers)"	"This model is for agents" — e.g. Gemini 3.5 Flash, Claude computer-use models

Where the per-turn savings actually come from

A back-of-envelope walk-through (illustrative numbers; substitute your own task for a real plan). Suppose a task needs 4 distinct tool calls to complete — fetch a user, read a permission policy, write an audit log, return a response. A chat-with-tools model with a per-turn tool-call accuracy of ~85% on a complex schema will, by compounding, succeed on the 4-step trajectory ~52% of the time on the first attempt (0.85⁴ ≈ 0.52). Every miss triggers harness-level retry — an extra 2–4 turns to get back on the rails. The expected turn count balloons to roughly 8–11 turns.

Now the agent-first version. If post-training on tool-call traces lifts per-turn accuracy to ~95%, the 4-step success rate rises to ~81% (0.95⁴ ≈ 0.81). Expected turn count drops to roughly ~5 turns — about 2× fewer turns per task. Combine that with Google's reported ~4× output tokens / second at the serving layer and the end-to-end wall-clock improvement is multiplicative — fewer turns and faster turns — even though each individual improvement is modest. That is the agentic-throughput story the Antigravity framing is pointing at, not raw single-shot benchmark wins.

The catch, and the reason agent-first is not a free win: training trajectories of 10+ tool calls is expensive. The traces need to be either synthesized in a closed-loop sandbox or harvested from a deployed harness, and either path adds infrastructure that pure chat post-training did not need. The serving cost story is also fragile — the ~4× tokens / second claim is not paired with public benchmark methodology in the Google blog, and the architecture that delivers it is not disclosed. Treat the throughput number as a directional headline rather than a guaranteed contract, the same way Jetson Thor's "7.5× compute" framing crossed precisions in the edge Blackwell explainer.

Goes deeper in: AI Agents → The Agent Loop & State → Harness anatomy

Related explainers

Tool-router contextual bandit — what the harness can still do for an agent-first model: choose the cheapest viable tool per turn, rather than burning the model's planning budget
Pantheon-bench — HITL vs autonomous coding — the eval side of agent-first: trajectories matter more than single-turn scores
MCP SEP-2663 — async task handles — what the transport layer looks like when the model on the other side actually expects to be in a loop

FAQ

What does agent-first actually mean, in one paragraph?

Agent-first means a model whose post-training mixture includes a large fraction of multi-turn tool-call trajectories — call, observation, error, recovery, success — and whose loss is shaped against the loop, not against single-turn chat. The model develops a prior over "what happens next inside a tool-use loop" rather than just "what a helpful assistant would say next." Google DeepMind's Gemini 3.5 Flash, announced May 25, 2026, is positioned as an agent-first Flash-class model paired with the Antigravity harness. Anthropic's computer-use models and OpenAI's o-series tool-reasoning models sit on the same trajectory.

Why does this matter for production agent systems?

Production agent latency, cost, and reliability are all dominated by turn count and per-turn error rate. A chat model with tools bolted on hallucinates function names, fumbles structured outputs, and treats tool errors as conversational surprises that need a re-prompt — each of those failures adds turns. An agent-first model trained on tool-call traces drops per-turn error rates and lets the harness shed retry and validation layers. The effect compounds: even modest per-turn accuracy gains translate to large end-to-end wins because trajectories compose multiplicatively. The harness gets simpler, observability gets quieter, and the SLO budget gets cheaper.

How is this different from just fine-tuning a chat model for tool use?

Fine-tuning a chat model for tool use teaches it to emit structured output and call functions, but it does not change the base prior. After fine-tune, the model still treats a 500 response or an unfamiliar tool error as something a chat assistant would react to — apologetically, conversationally — rather than as an in-loop signal to retry or switch tools. Agent-first models include tool-call traces in the pretraining-adjacent mixture (or in heavy post-training) so the prior itself is loop-shaped. Function-name hallucinations drop, multi-step planning horizons stretch, and recovery behaviour stops looking like apology and starts looking like control flow. The architectural specifics for Gemini 3.5 Flash are not publicly disclosed.

Originally posted on Learn AI Visually.

Top comments (1)

Harjot Singh • May 31

Agent-first model design is the right framing for where this is going, models tuned for tool-calling, long-horizon tasks, and cheap fast loops rather than one-shot Q&A. Flash-class models optimized for the agent loop matter more than raw benchmark IQ, because inside an agent you make dozens of cheap calls, not one expensive genius call, so speed and tool-calling reliability beat marginal reasoning gains. That's exactly why I route to fast cheap models for the bulk of Moonshift's loop and reserve the expensive one for the hard steps. Has Flash's tool-calling been reliable enough to trust unattended yet, or are you still babysitting it?