Anup Karanjkar

Posted on May 20 • Originally published at wowhow.cloud

Gemini 3.5 Flash: Complete Developer Guide — API, Benchmarks, Pricing & Migration (2026)

#gemini35 #migrategemini

Google shipped Gemini 3.5 Flash on May 19, 2026 at Google I/O — and it immediately replaced Gemini 3.1 Pro as the default choice for production developers. At $1.50 per million input tokens and $9.00 per million output tokens, it is 40% cheaper than 3.1 Pro. It runs 4x faster. It scores 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas for tool-use reliability, and 1656 Elo on real-world agentic tasks — outperforming every prior Flash-series model and matching frontier-class performance on coding benchmarks. If you are still routing production traffic through Gemini 3.1 Pro, this guide gives you everything you need to migrate today.

This guide covers the three breaking API changes in 3.5 Flash, benchmarks versus prior Gemini models and competitors, a pricing calculation for common usage patterns, a complete migration checklist, and TypeScript examples for basic generation, multimodal input, and agentic function calling.

What Gemini 3.5 Flash Is and Why It Matters

Gemini 3.5 Flash is Google DeepMind's newest and fastest frontier model, released generally available via the Gemini API on May 19, 2026. It is the first model in the 3.5 family, designed as the recommended model for agentic workflows, coding tasks, and high-volume production API usage. Gemini Spark — Google's persistent 24/7 agent platform — runs on it. Google AI Mode, now serving one billion monthly users, uses it as its default. If you have interacted with either product since I/O 2026, you have already used Gemini 3.5 Flash.

The model supports a 1M-token context window, 65k max output tokens, multimodal input (text, image, audio, video, PDF), thinking, function calling, code execution, and grounding with Google Search. It is a drop-in upgrade from gemini-3-flash-preview in most respects — with three API-level breaking changes that require attention before deploying to production.

Benchmarks: Where 3.5 Flash Lands

Google published benchmark scores across four dimensions most relevant for developers evaluating the model:

Terminal-Bench 2.1 (coding): 76.2%. This benchmark tests end-to-end coding tasks in a real terminal environment, including file I/O, package installation, test execution, and multi-step debugging. Gemini 3.1 Pro scored 71.4% on the same benchmark. Flash 3.5 is now Google's strongest coding model across the entire lineup.

GDPval-AA (real-world agentic tasks): 1656 Elo. GDPval-AA measures performance on real-world agentic tasks — research, booking, multi-step tool use, document handling — with human judges evaluating output quality. Gemini 3.1 Pro rated 1598 Elo on the same scale. The 58-point gap is significant at the upper end of the distribution.

MCP Atlas (scaled tool-use reliability): 83.6%. MCP Atlas measures how reliably a model calls tools correctly across sequences of dependent tool calls. For developers building multi-step agents with MCP servers, this score directly predicts how often your agent completes a task without stalling on a malformed or out-of-order tool call. This is the benchmark that matters most for production agent pipelines.

CharXiv Reasoning (multimodal understanding): 84.2%. CharXiv tests chart, graph, and document understanding in a visual context. Flash 3.5 is the first Flash-series model to approach frontier-level performance on this benchmark, making it viable for document analysis use cases previously reserved for Pro or Ultra tiers.

On third-party benchmarks, RD World reported that Gemini 3.5 Flash "scores within two points of Anthropic's flagship at a third of the price." Seeking Alpha confirmed it "surpasses GPT-5.5 in agentic benchmarks." The headline metric holds: near-flagship performance at mid-tier cost.

Pricing Breakdown

Gemini 3.5 Flash pricing is structured as follows:

Input tokens: $1.50 per million (standard regions); $1.65/M non-global
Output tokens: $9.00 per million (standard); $9.90/M non-global
Cached input tokens: $0.15 per million — 90% cheaper than uncached input
Thinking tokens: Billed at the output rate when thinking_level is set; thought preservation adds tokens across multi-turn conversations

To compare directly: Gemini 3.1 Pro was priced at $2.00 input / $12.00 output per million tokens. Switching to 3.5 Flash cuts both input and output costs by 25%, while delivering better benchmark scores. For applications using context caching — storing system prompts, reference documents, or shared instructions — the $0.15/M cached input rate makes repeated-context reads effectively free.

A concrete cost example: a production application making 10,000 API calls per day, averaging 2,000 input tokens and 500 output tokens per call, spends roughly $30/day on input and $45/day on output using 3.5 Flash. The same workload on 3.1 Pro cost $40/day input and $60/day output. The monthly cost difference is approximately $1,500 — without any code changes beyond updating the model string.

Migration Guide: Three Breaking API Changes

Gemini 3.5 Flash introduces three API-level changes that affect existing code targeting Gemini 3.1 Pro or the gemini-3-flash-preview preview model. Each is documented below with the exact change required.

1. Model Name Update

The stable GA identifier replaces the preview suffix. Update your model string:

// Before: preview or 3.1 Pro
const model = genAI.getGenerativeModel({ model: "gemini-3-flash-preview" });

// After: Gemini 3.5 Flash stable GA
const model = genAI.getGenerativeModel({ model: "gemini-3.5-flash" });

2. Thinking Configuration: Enum Replaces Integer

The thinking_budget integer parameter has been replaced with a thinking_level string enum. The default changed from high to medium. Remove temperature, top_p, and top_k from your generation config — they are no longer recommended and are silently ignored in Flash 3.5.

// Before: 3.1 Pro thinking config
const result = await model.generateContent({
  contents: [{ role: "user", parts: [{ text: prompt }] }],
  generationConfig: {
    temperature: 0.7,
    top_p: 0.95,
    thinking_budget: 8192,
  },
});

// After: 3.5 Flash thinking config
const result = await model.generateContent({
  contents: [{ role: "user", parts: [{ text: prompt }] }],
  generationConfig: {
    thinking_level: "high",   // "low" | "medium" | "high"
    maxOutputTokens: 65000,
  },
});

Important: Thought preservation is on by default in 3.5 Flash. In multi-turn conversations, the model carries thinking tokens forward across turns, which increases total billed token usage. Audit your multi-turn flows after migration and add explicit context truncation if costs rise unexpectedly on long sessions.

3. Function Calling: Add id and name to FunctionResponse

Function responses now require explicit id and name fields matching the originating function call from the model:

// Before: 3.1 Pro function response
const functionResponse = {
  role: "function",
  parts: [{
    functionResponse: { name: "get_weather", response: { temp: 22 } }
  }],
};

// After: 3.5 Flash function response — id and name required
const functionResponse = {
  role: "function",
  parts: [{
    functionResponse: {
      id: functionCall.id,    // must match the id from the model FunctionCall part
      name: "get_weather",    // must match the function name exactly
      response: { temp: 22 },
    },
  }],
};

If you use multimodal function responses (returning images or structured data from a function call), move the multimodal content inside the functionResponse part, not alongside it as a separate part. If you append inline instructions to a function response, append them to the response text field separated by two newlines — not as a separate content part.

TypeScript Code Examples

Basic Generation

import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

const model = genAI.getGenerativeModel({
  model: "gemini-3.5-flash",
  generationConfig: {
    thinking_level: "medium",
    maxOutputTokens: 8192,
  },
});

const result = await model.generateContent(
  "Explain the trade-offs between RSA and ECC for TLS certificates in production."
);

console.log(result.response.text());

Multimodal Input: PDF Analysis

Flash 3.5 handles PDF, image, audio, and video natively in a single API call — matching the full multimodal input now available in the Google Search box:

import fs from "fs";
import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: "gemini-3.5-flash" });

const pdfBuffer = fs.readFileSync("./contract.pdf");
const base64Pdf = pdfBuffer.toString("base64");

const result = await model.generateContent({
  contents: [{
    role: "user",
    parts: [
      {
        text: "List all termination clauses and auto-renewal provisions in this contract.",
      },
      {
        inlineData: {
          mimeType: "application/pdf",
          data: base64Pdf,
        },
      },
    ],
  }],
  generationConfig: { thinking_level: "high", maxOutputTokens: 4096 },
});

console.log(result.response.text());

Agentic Function Calling with id Matching

import {
  GoogleGenerativeAI,
  type FunctionDeclaration,
  type Tool,
} from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: "gemini-3.5-flash" });

const tools: Tool[] = [{
  functionDeclarations: [{
    name: "search_database",
    description: "Search the internal product database.",
    parameters: {
      type: "object",
      properties: {
        query: { type: "string", description: "Search query" },
        limit: { type: "number", description: "Max results to return" },
      },
      required: ["query"],
    },
  } as FunctionDeclaration],
}];

const chat = model.startChat({ tools });
const response = await chat.sendMessage(
  "Find the top 5 products in the developer tools category."
);

const candidate = response.response.candidates?.[0];
const fnCallPart = candidate?.content.parts.find((p) => p.functionCall);

if (fnCallPart?.functionCall) {
  const { name, args, id } = fnCallPart.functionCall;
  const dbResults = await searchDatabase(
    args as { query: string; limit?: number }
  );

  // id and name are required in 3.5 Flash — different from 3.1 Pro
  const followUp = await chat.sendMessage([{
    functionResponse: {
      id,
      name,
      response: { results: dbResults },
    },
  }]);

  console.log(followUp.response.text());
}

When to Use Flash vs Pro vs Ultra

With Flash 3.5 outperforming 3.1 Pro on most benchmarks at lower cost, the model routing decision has simplified significantly:

Use Gemini 3.5 Flash for the large majority of production workloads: chat interfaces, document analysis, coding assistance, agentic pipelines with MCP tool use, batch processing, and any task where throughput or cost matters. The model now handles tasks that previously required Pro. This is the new default for new projects starting today.

Use Gemini 3.1 Pro for Computer Use tasks only. Flash 3.5 does not support Computer Use at launch. If your pipeline includes browser automation, desktop interaction, or screen reading via computer use, 3.1 Pro remains the required model until Google adds support to the Flash series.

Use Gemini 3.5 Pro (expected approximately June 2026) for tasks requiring maximum reasoning depth: complex multi-step research, long legal or financial document synthesis requiring deep inference, or tasks where token budget is not a constraint.

Use Gemini 3.5 Ultra for the highest-stakes reasoning when it ships. Flash handles the throughput tier; Pro and Ultra handle the capability ceiling.

Agentic Use Cases: Where Flash 3.5 Was Optimized

The benchmark composition signals where Google spent optimization effort. MCP Atlas at 83.6% means Flash 3.5 reliably completes multi-step tool chains — the class of tasks where earlier models frequently stalled on malformed or out-of-order tool calls. GDPval-AA at 1656 Elo means it completes real-world agentic tasks at a rate approaching human-level performance on the evaluation set.

The model is the backend for two of Google's largest-scale agent deployments: Gemini Spark — the persistent 24/7 agent platform covered in depth in the Gemini Spark and Antigravity 2.0 developer guide — and Google AI Mode, which crossed one billion monthly users as covered in the Google AI Mode analysis from I/O 2026. Both require the model to maintain goal state across extended interactions and complete tool-use chains reliably at scale.

For developers building production agents: migrating an existing agentic pipeline from 3.1 Pro to 3.5 Flash should produce measurable improvement in task completion rates and reduce tool-call error recovery overhead. If your current agent has retry logic specifically to handle malformed function calls from the model, test whether that logic is still exercised after migration — with MCP Atlas at 83.6%, it often will not be.

Context Window: Using 1M Tokens Responsibly

The 1M-token context window and 65k output token limit are available in Flash 3.5 on GA. At $1.50/M input with $0.15/M cached, large context becomes economically viable at scale. A 200-page technical document, a complete application codebase, or six months of customer support transcripts can fit in a single context window and be reasoned over in a single call.

Two things to keep in mind. First, thought preservation across multi-turn conversations adds thinking tokens to the context accumulated over a session. A ten-turn conversation with thinking_level: "high" can accumulate substantially more tokens than the raw user/assistant turn count implies. Monitor token usage on long sessions before they reach production. Second, thinking tokens do not count against the 65k output limit but are billed at the output token rate. Structure your thinking budget accordingly.

Migration Checklist

Update model identifier from gemini-3-flash-preview or gemini-3.1-pro to gemini-3.5-flash
Replace thinking_budget: number with thinking_level: "low" | "medium" | "high"
Remove temperature, top_p, and top_k from all generation configs
Add id and name fields to every FunctionResponse part
Move multimodal content inside functionResponse parts if using multimodal function responses
Audit multi-turn conversation token usage — thought preservation is now on by default
Remove Computer Use logic from Flash paths; route those calls to Gemini 3.1 Pro separately
Update SDK to @google/generative-ai@0.21.0 or later for the updated type definitions
Run your existing test suite against the new model identifier before promoting to production

Getting Started Today

Gemini 3.5 Flash is generally available via the Gemini API as of May 19, 2026. The model string is gemini-3.5-flash. Update your SDK first:

npm install @google/generative-ai@latest

Google AI Studio at ai.google.dev has been updated to use Flash 3.5 as the default model in the playground. If you have been testing prompts in AI Studio since I/O and noticed faster responses, that is the model change. The API pricing page at ai.google.dev/gemini-api/docs/pricing reflects the current Flash 3.5 rates.

For teams starting new agentic projects: the Antigravity 2.0 developer platform, released alongside Flash 3.5 at I/O, provides a full agent harness, multi-agent orchestration, and a managed MCP gateway that runs Flash 3.5 as its default. The Gemini Spark and Antigravity 2.0 guide covers the platform in full if you are evaluating whether to build your orchestration layer from scratch or adopt the Google-native stack.

Developer tools, starter kits, and API integration templates for Gemini and other frontier models are available at wowhow.cloud.

Originally published at wowhow.cloud

DEV Community