Background
Gemini 3.5 Flash launched at Google I/O 2026 with a bold claim: it beats Gemini 3.1 Pro on coding and agentic benchmarks — while running 4x faster.
At the same time, X (formerly Twitter) is full of posts saying it hallucinates constantly and doesn't even reach Claude Sonnet level.
So which is it? I ran a real benchmark using code from my actual dev stack to find out.
Who I am
- Solo indie Mac app developer (Tauri + Rust + Swift stack)
- I use Gemini daily as part of my coding workflow
- Built 13 macOS utilities, mostly Android connectivity tools
The Test
Models compared
- Gemini 3.1 Pro
- Gemini 3.5 Flash (new)
What I tested
I gave both models a ~200-line Rust file (ADB device manager) with 14 intentional bugs and asked them to find and fix everything.
Why 200 lines? Because in my experience:
- Under 50 lines: any model gets lucky sometimes
- Over 100 lines: older Flash models produce near-unusable code
- 200 lines: a realistic production task that separates real understanding from pattern matching
Bug breakdown
| Category | Count | Examples |
|---|---|---|
| Logic bugs (main) | 7 | Post-execution timeout check, APK install failure not detected |
| Async bugs | 2 | CPU-spinning busy loop, potential data race |
| ADB-specific traps | 3 |
\r leftover in line endings, temp file not cleaned up |
| Missing tests | 2 | Edge cases not covered |
The ADB-specific bugs were key — you need domain knowledge to catch them, not just Rust syntax awareness.
Prompt (no hints)
The following Rust code contains several bugs.
Please identify all bugs and provide the corrected code.
Include an explanation for each bug.
No hints. No scaffolding. Raw capability test.
The Code (buggy version)
adb_device_manager.rs — click to expand
use std::process::{Command, Stdio};
use std::time::{Duration, Instant};
use std::sync::{Arc, Mutex};
use tokio::time::sleep;
use std::collections::HashMap;
#[derive(Debug, Clone)]
pub struct AdbDevice {
pub serial: String,
pub state: DeviceState,
pub properties: HashMap,
}
#[derive(Debug, Clone, PartialEq)]
pub enum DeviceState {
Online, Offline, Unauthorized, Unknown,
}
pub struct AdbManager {
devices: Arc>>,
adb_path: String,
command_timeout: Duration,
}
impl AdbManager {
pub fn new(adb_path: String) -> Self {
AdbManager {
devices: Arc::new(Mutex::new(Vec::new())),
adb_path,
// BUG 1: from_millis(5) — should be from_secs(5)
command_timeout: Duration::from_millis(5),
}
}
pub fn execute_command(&self, serial: &str, args: &[&str]) -> Result {
let start = Instant::now();
// prepend "-s " before command args
let mut cmd_args = vec!["-s", serial];
cmd_args.extend_from_slice(args);
let output = Command::new(&self.adb_path)
.args(&cmd_args)
.output()
.map_err(|e| format!("Command failed: {}", e))?;
// BUG 4: timeout check AFTER command completes — completely useless
if start.elapsed() > self.command_timeout {
return Err("timed out".to_string());
}
Ok(String::from_utf8_lossy(&output.stdout).to_string())
}
pub async fn wait_for_device(&self, serial: &str, timeout_secs: u64) -> Result<(), String> {
let deadline = Instant::now() + Duration::from_secs(timeout_secs);
loop {
// BUG 8: no sleep → CPU at 100%
let devices = self.get_connected_devices()?;
if devices.iter().any(|d| d.serial == serial) {
return Ok(());
}
if Instant::now() >= deadline {
return Err("timeout".to_string());
}
}
}
pub fn install_apk(&self, serial: &str, apk_path: &str) -> Result<(), String> {
let result = self.execute_command(serial, &["install", "-r", apk_path])?;
// BUG 11: adb install returns exit code 0 even on failure
// must check stdout for "Success"/"Failure" strings
if result.contains("Failure") {
return Err(format!("Install failed: {}", result));
}
Ok(())
}
pub fn take_screenshot(&self, serial: &str, save_path: &str) -> Result<(), String> {
let temp_path = "/sdcard/screenshot_temp.png";
self.execute_command(serial, &["shell", "screencap", "-p", temp_path])?;
// BUG 12: temp file never deleted from device
self.execute_command(serial, &["pull", temp_path, save_path])?;
Ok(())
}
}
Results
Bug detection
| Model | Bugs found | Score |
|---|---|---|
| Gemini 3.1 Pro | 14 | 14/14 ✅ |
| Gemini 3.5 Flash | 14 | 14/14 ✅ |
Both models found every bug. Accuracy: identical.
Speed
| Model | Response time |
|---|---|
| Gemini 3.1 Pro | ~40 seconds |
| Gemini 3.5 Flash | A few seconds (10x+ faster) |
This was the most striking difference by far.
Where the models diverged
Same score, but different approaches on a few interesting bugs:
Bug 4 — timeout check after execution
3.1 Pro rewrote it using tokio::time::timeout (fully async)
3.5 Flash used spawn() + try_wait() polling loop (sync-leaning approach)
Both are valid fixes. Different style choices.
Bug 10 — Mutex poison handling
3.1 Pro: into_inner() to safely recover the data
3.5 Flash: expect() for fail-fast behavior
Opposite design philosophies. Neither is wrong — depends on your error handling strategy.
Bug 6 — spaces in remote path
3.1 Pro: correctly noted that Command::new handles args without shell splitting, so no quoting needed — left it as-is (accurate ADB knowledge)
3.5 Flash: added format!("\"{}\"", remote_path) quoting (technically unnecessary, slight overreach)
3.1 Pro showed deeper understanding of how ADB + Rust process spawning actually works.
Pricing reality check
App (free plan)
| Model | Cost | Speed |
|---|---|---|
| 3.1 Pro | Free | Slow (~40s) |
| 3.5 Flash | Free | Fast (few seconds) |
API (Pay-as-you-go)
Straight from Google AI Studio's official UI:
| Model | Input / 1M tokens | Output / 1M tokens |
|---|---|---|
| Gemini 3.5 Flash | $1.50 | $9.00 |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 |
| Context Caching | $0.15 | — |
The $9.00 output price is 3x the previous generation (Gemini 3 Flash at $3.00). Google's "half the price of frontier models" pitch compares against competitors — not their own previous Flash tier.
For indie developers:
- Prototyping / testing → Free tier is more than enough
- Production / commercial → $1.50/$9.00. Budget carefully for output-heavy workloads.
Bonus: I asked Gemini about its own price. It hallucinated.
During testing I asked Gemini 3.5 Flash directly: "What's the API pricing for Gemini 3.5 Flash?"
It confidently answered:
"Input: ~$0.50 / Output: ~$3.00 per million tokens!"
That's the old Gemini 3 Flash Preview pricing. The actual price is $1.50/$9.00.
When I told it the real number, it immediately replied:
"I sincerely apologize! The information you provided is 100% correct!"
The model that aced a 14-bug Rust challenge couldn't accurately describe its own pricing.
A hallucination detection benchmark article ending with a hallucination felt appropriate.
Conclusion
| Category | Result |
|---|---|
| Coding accuracy (200-line bug fix) | 3.1 Pro ≈ 3.5 Flash |
| Output speed | 3.5 Flash wins by 10x+ |
| API cost (output) | 3.5 Flash at $9.00/1M tokens (3x previous gen) |
| Free tier usability | 3.5 Flash is the clear winner |
For free tier users: switch to 3.5 Flash immediately.
For API cost-conscious production use: consider 3.1 Flash-Lite at $0.25/$1.50.
On the "doesn't reach Claude Sonnet" criticism — at least for Rust bug-fix tasks, both Flash models performed at a level I'd call genuinely useful. The hallucination complaints may apply more to conversational/knowledge tasks than structured code review — though in my limited testing with a single task type, I can't say for certain.
I build macOS utilities for Mac×Android workflows. If you're into Tauri, ADB, or MTP on macOS, feel free to follow.
Top comments (6)
Hey - trying to buy Hiyoko Shot but been getting an error at checkout on gumtree last few days now - anywhere else I can pay for this? :)
Hi! Sorry about that — I'm looking into it on my end.
Gumroad should support UK payments, so something else might be going on.
I'll check the dashboard and get back to you. Thanks for letting me know!
Hi! I've sent you an email with an update. Sorry again for the trouble!
Some comments may only be visible to logged-in visitors. Sign in to view all comments.