hiyoyo

Posted on May 26

I Used Gemini 3.5 Flash via Direct API for a Week — An Honest Report on Its Speed Trade-offs and Real-World Pitfalls

#gemini #claude #ai #tauri

Introduction

After publishing a piece on debugging Rust code with Gemini 3.5 Flash, the response was bigger than I expected. So I spent the following week integrating it directly into my personal development workflow via the API and documenting the results.

The previous article was a controlled test. This one is about real-world usage. Bottom line: it can be a powerhouse depending on how you use it — but use it without thinking and you'll hit a wall fast.

What Worked Well

Simple tasks are accurate and blazing fast

For straightforward requests like "which files do I need to touch to remove the dock icon display for this app?", the speed is genuinely impressive. My gut feeling is it's at least 10x faster than 3.1 Pro.

Bulk edits are stable — if your instructions are solid

I threw cross-file edits totaling around 300 lines at it in a single prompt, and when the instructions were precise, it handled them cleanly. For developers who can write tight, complete instruction documents, this becomes a seriously fast tool.

Evaluations feel more grounded than 3.1 Pro

When I asked it to "evaluate this critically," the feedback felt more realistic than what I'd get from 3.1 Pro. This is specifically for non-code evaluation — ideas, writing, etc. For use cases where you want "tell me the real situation, not the ideal," there's a noticeable improvement.

What Was Genuinely Painful

Token consumption burns through your budget alarmingly fast

This was the biggest surprise. Even in casual exchanges, I'd watch nearly 10% of my available tokens disappear within 5 minutes. A significant portion of that is thinking tokens — the model reasoning internally while you wait. If you're planning any long sessions, you need to design your cost model carefully or you'll get hit with an unexpected bill.

Under server load, the "blazing fast" reputation disappears

Speed is one of the main selling points, but when the server is under heavy load, it's just... normal speed. Sometimes nearly identical to 3.1 Pro. "Blazing fast" is more accurately "blazing fast when the servers aren't busy."

Complex logic chains start to break down

Simple bug fixes and single-file edits are fine. But in real development, I noticed things going wrong when around five or more pieces of logic were interconnected.

When you have something like "A triggers B triggers C triggers D triggers E, and changing A affects E," it starts missing intermediate connections and making edits that ignore those dependencies. The previous benchmark (200 lines, 14 bugs, all correct) worked because the bugs were relatively independent. Real-world dependency chains are a different animal.

Round-trip costs pile up on complex problems

There was a bug I handed off to Gemini where we went through multiple cycles of investigation → fix → re-investigation, and I eventually spent more time reviewing its attempts than I would have spent just fixing it myself.

"Just hand it to AI and it'll be faster" only holds when the problem is simple. On complex problems, the round-trip overhead stacks up, and doing it yourself can be quicker.

Security-related code gets missed

I had Gemini work on a file that mixed front-end and back-end logic, and it moved forward with a fix while quietly missing a gap in security-related processing.

I only caught it because I later handed the code to Claude Opus, which flagged it immediately: "This code is missing its security processing." If I hadn't run that second check, I wouldn't have caught it.

Whether Gemini actively removed it or simply missed an existing gap is unclear, but I've since made it a habit to verify any security-related code with a separate model or by hand after Gemini edits it. For complex cases, doing it myself often ends up being faster anyway.

Long-context ingestion is mediocre — but that's not unique to Gemini

Feeding in large amounts of text at once degrades accuracy. This happens with Claude Opus too. Bulk long-context reading is an industry-wide problem, not a Gemini-specific one.

How to Get the Most Out of It

After a week of use, what became clear is that Gemini 3.5 Flash is a model that's very sensitive to how you use it.

Stick to well-scoped, simple tasks
If you can write a precise instruction document, bulk edits are fair game
For security-related code, verify with a separate model or by hand after any Gemini edit — and for complex cases, just fix it yourself
For complex problems, going straight to Opus or Sonnet is often faster overall

How I Split Work Between Claude and Gemini

Use Case	Better Model
Simple repetitive edits	Gemini 3.5 Flash
Bulk file edits (with a solid instruction doc)	Gemini 3.5 Flash
Security-related code	Claude, no question
Problems with complex interdependent logic	Claude Sonnet / Opus
Core architecture and foundation work	Claude Opus
Long-context bulk ingestion	Neither is great

Gemini's strengths are speed and volume. Claude's are complexity and precision. It's not about which is better — it's about matching the model to the task.

Conclusion

Gemini 3.5 Flash is a powerful tool for developers who use it deliberately. But if your approach is "just throw it at the AI and see what happens," expect to be let down.

Understand the token consumption rate and plan your costs accordingly
Invest in writing better instruction documents
Always verify security-related code with a separate model or by hand — complex cases are usually faster to fix yourself
For genuinely complex problems, consider reaching for a different model from the start

If you read this alongside my previous benchmark article, you'll get both sides: what it can do in controlled testing, and what it's actually like in daily development.

Appendix: Music Analysis via URL Doesn't Work

Unrelated to development, but I saw claims that "Gemini 3.5 improved significantly at music analysis," so I tested it by passing a YouTube link and asking it to analyze the track. It started describing a completely different song.

In hindsight, the model isn't actually listening to the audio at the URL, so this shouldn't be surprising — but there's a meaningful gap between the hype and the reality. For music, my own ears and knowledge are faster and more accurate.

I build Mac × Android utilities as a solo indie developer.
X → @hiyoyok

DEV Community