This article was originally published on aicoderscope.com
TL;DR: Gemini 3.5 Flash went GA on May 19, 2026 and costs 50% less than Claude Sonnet 4.6 on input tokens ($1.50 vs $3.00/M). It generates code at ~284 tokens per second — roughly 4.7× faster than Sonnet 4.6. Cursor already lists it natively; Cline needs one extra config step. The trap: Flash's default thinking level is "medium," which is slower and pricier than "low," the setting Google specifically tuned for coding and tool-use loops.
| Gemini 3.5 Flash | Claude Sonnet 4.6 | DeepSeek V4-Flash | |
|---|---|---|---|
| Best for | Fast agent loops, context-heavy analysis | Complex refactors, instruction fidelity | Cost-capped high-volume tasks |
| Input / Output per 1M tokens | $1.50 / $9.00 | $3.00 / $15.00 | $0.14 / $0.28 |
| Context window | 1M tokens | 200K tokens | 1M tokens |
| Terminal-Bench 2.1 | 76.2% | — | — |
| Output speed | ~284 t/s | ~60 t/s | — |
| Max output per request | 65,536 tokens | 64K tokens | 64K tokens |
| The catch | Output at $9/M erodes savings on code-gen | 15× pricier output than Flash | No vision, MIT-licensed |
Honest take: Use Gemini 3.5 Flash with Cline for multi-step agent tasks where round-trip latency compounds and context windows run large. Stay on Claude Sonnet 4.6 when you need a hard refactor to land perfectly on the first try — Sonnet's 79.6% SWE-bench Verified score still leads Flash's on correctness benchmarks.
The cost math that does and doesn't work
Gemini 3.5 Flash charges $1.50 per million input tokens and $9.00 per million output tokens. Against Claude Sonnet 4.6 at $3.00/$15.00, the input side is a genuine 2× saving. The output side is almost the same story: $9 vs $15 is 40% cheaper per generated token.
Run the numbers on a typical Cline coding session: 8 tool calls, reading 12 files (roughly 20,000 context tokens), generating 500 lines of code output (~7,000 output tokens).
- Sonnet 4.6: (20K × $3 + 7K × $15) / 1,000,000 = $0.165/session
- Gemini 3.5 Flash: (20K × $1.50 + 7K × $9) / 1,000,000 = $0.093/session
That's 44% cheaper per session. At 50 sessions a month — a realistic Cline user — you save about $36/month switching from Sonnet.
Where the math flips: if you're running long autonomous coding agents that produce 30,000+ output tokens per run (full files, multiple rounds of test generation), Flash's $9/M output adds up. At 100K output tokens in one session you're paying $0.90 in output costs alone. Sonnet at $15/M would be $1.50 — still more expensive, but the gap narrows.
For latency-sensitive agentic loops — where an agent does 40 small tool calls across a 20-minute session — Flash's 4.7× speed advantage is the bigger win. Every code suggestion, context lookup, and diff round-trip is nearly five times faster. That compounds into sessions that feel instant rather than sluggish.
Google also offers cached input pricing at $0.15/M for repeated prefixes like your .clinerules system prompt or a large code context you're reusing. Once cached, those tokens cost one-tenth of fresh input tokens.
What Gemini 3.5 Flash actually is
Google shipped Gemini 3.5 Flash to GA on May 19, 2026 at Google I/O, positioning it explicitly as their strongest model for coding agents and agentic tool use. Notably, it outperforms Gemini 3.1 Pro on coding benchmarks — not just compared to older Flash models.
On Terminal-Bench 2.1, it scores 76.2%, second only to GPT-5.5 (78.2%). On MCP Atlas, the benchmark for multi-step tool-call chains, it hits 83.6% — a score that reflects why you can actually trust it in 30-step Cline loops where older Flash models would derail.
Thinking mode is built in, with four levels: minimal, low, medium (default), and high. This is where most configurations go wrong, covered below.
Context window: 1,048,576 input tokens. Maximum output: 65,536 tokens. Multimodal inputs — text, image, audio, video — are all supported, which matters when you want Cline to read a screenshot of an error dialog or analyze a UI mockup alongside code.
The thinking-level trap
Every gemini-3.5-flash integration you copy from a May 2026 blog post or GitHub gist likely has a silent configuration error: it doesn't set thinking_level.
Flash's default thinking level is medium — not high, not low. Medium is the balanced setting for general-purpose tasks. For coding and tool-calling workflows, Google specifically retuned the low level: it's faster, cheaper (lower thinking token overhead), and on coding benchmarks it performs comparably to medium.
If you're porting config from gemini-2.5-flash or gemini-3-flash-preview, those models had different defaults. Copying the model ID without setting thinking_level: "low" for a coding workload means you're paying for unnecessary reasoning overhead on every tool call.
When would you use high? Multi-file architecture decisions, debugging obscure errors that require chained logic, or writing complex algorithms. For "read this file, add a null check, run the test" loops — that's low.
The full thinking level behavior:
-
minimal: fastest, lowest cost, skip most reasoning steps -
low: tuned for code and agentic tasks — Google's recommendation for coding workflows -
medium: default; general reasoning tasks -
high: full reasoning chains; use for hard algorithmic or architecture problems
In Cline, you set this via the system prompt or through the provider-specific config. In Cursor's built-in Gemini integration, the thinking level is managed by Cursor — you don't control it directly. If precise control matters, the custom API path (below) gives you the parameter.
Cursor: native support, no config needed
Cursor added Gemini 3.5 Flash to its native model list and has official documentation for it at cursor.com/docs/models/gemini-3-5-flash. It appears in the model dropdown in both Chat and Composer.
When you select it in Cursor:
- Token billing draws from Cursor's API pool at Google's rates: $1.50/M input, $9.00/M output
- Full agent tool access: codebase search (by semantic meaning and exact match), file reads, grep, directory traversal
- Context window: 1,048,576 tokens in scope
- Tab autocomplete: not available — same as all API-pool models in Cursor; tab runs only on Cursor's own served models
To enable it: open Cursor → Settings → Models → toggle gemini-3.5-flash to on. No API key required when billing through Cursor's API pool.
If you want to bring your own Google AI Studio key (to bill directly to your Google account and avoid the Cursor API pool):
- Settings → Models → Custom Models → Add Model
- Model Name:
gemini-3.5-flash - OpenAI Base URL:
https://generativelanguage.googleapis.com/v1beta/openai - API Key: your Google AI Studio key (get one free at
aistudio.google.com) - Click Verify
Expected verify output:
Model verification successful
gemini-3.5-flash — available
Do not append /v1 to the base URL. The Gemini-to-OpenAI compatibility layer at /v1beta/openai handles routing internally; the extra path causes a 404 on verification.
Cline: Google Gemini provider or OpenAI-compatible
Cline has a native Google Gemini provider. In the Cline sidebar, click the settings gear → API Provider → select Google Gemini. Enter your Google AI Studio key and set the model to gemini-3.5-flash.
As of Cline's May 2026 builds, the model dropdown may not yet list gemini-3.5-flash explicitly (GitHub issue #10944 tracks this). If it's absent from the dropdown, type the model ID directly into the model name field — Cline passes it to Google's API verbatim and the call works correctly.
If you prefer the OpenAI-compatible path (for portability, or if you're routing through OpenRouter):
- API Provider → OpenAI Compatible
-
Base URL:
https://generativelanguage.googleapis.com/v1beta/openai -
Model:
gemini-3.5-flash - API Key: your Google AI Studio key
Via OpenRouter, use
Top comments (0)