DEV Community

Gemini 2.5 Flash vs Claude 3.7 Sonnet: 4 Production Constraints That Made the Decision for Me

Dumebi Okolo on March 10, 2026

An evaluation of the Gemini 2.5 flash and Claude 3.7 Sonnet model for an agentic engine. I had a simple rule when choosing an LLM for Ozigi: don't...

Read full post

Anmol Baranwal • Mar 11

I was actually writing a blog post several months ago on which model is best for which use case -- where I studied 20+ models on 15+ parameters & use cases so I know how hard it is to decide

real results vs benchmarks gap is always there in my experience. still think claude sounds the most human but the Banned Lexicon idea is pretty smart.

one more thing: you can use "help wanted" labels on your issues (just noticed v4-priority). when developers search for open source projects to contribute to, it's a default filter!

let me go through this over the weekend & I will message you.

Dumebi Okolo • Mar 11

I plan to eventually migrate the engine to Claude.
But, so far, Gemini gives me what I want.
As for the Github issues, I guess I was a bit shy. 😅
I wasn't sure if people would be interested in contributing or the project entirely, so I set the issues tab as a sort of Trello board to track progress and updates.

Dumebi Okolo • Mar 11

Thanks, Anmol!

Kai Alder • Mar 10

Really solid ADR writeup. The Banned Lexicon approach is clever — I've been doing something similar with negative constraints in my own prompts and it works way better than just asking the model to "sound natural."

One thing I'm curious about: have you noticed the JSON stability gap changing with newer Claude versions? I ran into the markdown code fence issue constantly with 3.5 Sonnet but it got noticeably better after they shipped tool_use improvements. Still not at Gemini's responseSchema level though, that's basically cheating in the best way possible.

The 40x cost difference is wild. At pre-revenue that basically makes the decision for you regardless of everything else.

Dumebi Okolo • Mar 10

Yes, the cost is the most important thing I looked out for.
With two weeks of constantly running the engine, I have burnt through my free Google API credits.

Tommy Leonhardsen • Mar 11

If cost is the primary driver, skip the commercial APIs entirely. Qwen3.5 via Ollama Cloud is $20/month flat-rate unlimited. For a structured JSON generation task like yours it's more than capable, and you'll never burn through free credits again because there are none — just a fixed monthly ceiling.

Dumebi Okolo • Mar 11

What of the quality and human cadence score of the content it produces? That's something to consider.
I used Qwen a lot last year, but for some reasons, it never just cut it for me.

Plus, it's actually easier to work with $1300 in free credit, than to pay $20 in credit seeing as we are pre revenue. 😅

Tommy Leonhardsen • Mar 11

Latest Qwen models are very good. So is a lot of the models available in Ollama Cloud.

If you cannot afford 20usd/month your company is in dire economic straits.

Anyways, best of luck to you going forwards.
I am fairly certain the world does not really NEED more AI generated ads, but you do you!

Dumebi Okolo • Mar 11 • Edited

Thank you very much for your good wishes.

I just launched this product, and I am not making any revenue from it yet. Every investment I have made has been out-of-pocket. [If you are willing to angel-invest in us, I will be very grateful for this gesture.]
I am not required to use Qwen models or models available in the Ollama cloud. Gemini works very well for me at this time, and I will be moving over to Claude in due time.
The solution Ozigi provides isn't for the world to be filled with more AI-generated ads as you call them. It's to solve the blank page syndrome writers like me face when coming up with social media content copy. If you had used the app, you'd notice that there is an edit button. This signals to people to not trust 100% what the AI generates but to put in your own voice or opinion. The engine simply gives you a place to start from.

I appreciate your concern on safeguarding the internet from AI content, as I am also passionate about this. We are all just trying to build solutions that help people in their workflows.

Tommy Leonhardsen • Mar 11

The JSON stability comparison isn't really Gemini vs Claude — it's Gemini's responseSchema vs prompted Claude. That's not a fair test.

Claude's tool use and structured outputs enforce schema at the decoding layer exactly like responseSchema does. A simple tool definition with your campaign schema and tool_choice: {type: "tool", name: "generate_campaign"} gets you to 100% parse reliability, same guarantee, no middleware.

The PDF claim also doesn't hold up. Claude has supported native base64 PDF ingestion directly in the Messages API for a long time — no OCR step, no external service. Same inlineData pattern, different field names.

Latency and cost comparisons are legitimate and Gemini Flash wins those on current pricing. But two of your four constraints were implementation gaps, not model limitations.

Also worth noting: Claude 3.7 Sonnet is over a year old. Claude Sonnet 4.6 and Haiku 4.5 are both faster, cheaper and better than any 3.7 model was. The latency and cost numbers in this article were already outdated when it was published.

Dumebi Okolo • Mar 11

You're right on both counts, and I will be honest about it.

On JSON stability: the framing was I'm precise and a bit clickbaity.
The comparison was
Gemini with responseSchema vs Claude with prompted JSON — not
Gemini vs Claude's structured output ceiling. Claude's tool use
with tool_choice enforcement gets you to the same structural
guarantee at the decoding layer. The article notes this but buries
it in a footnote when it should have been the headline caveat.
I take responsibility for that.

On PDF ingestion: you're correct, and I got this entirely wrong. Claude's
Messages API does support native base64 PDF ingestion via the
document content block — no OCR preprocessing required. The
"5 steps, 2 additional failure points" claim was based on a
misread of the integration path and shouldn't have made it into
the article as written. I'll update that section.

On model choice: I was already inside the Vertex AI ecosystem and
evaluated models available in the Google Model Garden. Claude 3.7
Sonnet was the version accessible there at the time. But I agree the
article should have stated that more explicitly rather than framing
it as a general Gemini vs Claude evaluation.

The latency and cost numbers are the ones I'm most confident in
because they were measured in my actual production environment
(Vercel serverless, non-streaming, same input payload). Those
comparisons hold.

Thanks for pushing back though. This is the kind of correction that
makes the article worth more than it was when I published it.
I will be revising the article.

Dumebi Okolo • Mar 11

Claude has greatly, already improved in JSON stability, but I missed this.

Dumebi Okolo • Mar 11

Awesome. 👌

Swift • Mar 10

This is a super interesting writeup. Mirrors some of my own anecdotal experience as well. Thanks for sharing!

Dumebi Okolo • Mar 10

Thank you so much!
Glad you enjoyed reading it.

Apex Stack • Mar 16

The JSON stability constraint hits different when your schema isn't just 3 objects — it's 8,000+ structured financial profiles across 12 languages. I run a content generation pipeline with Llama 3 locally for a programmatic SEO site, and the structured output problem scales non-linearly. At 500 generations your 88.5% parse rate means ~57 broken pages. At 10,000 generations that's 1,150 silent failures rendering incorrect P/E ratios or missing dividend data in production.

Your Banned Lexicon approach maps directly to something I've been wrestling with on the content side. We built similar negative constraint lists for financial analysis text — banning phrases like "poised for growth" or "investors should note" that make every stock page read like the same AI wrote it. The gap being "engineerable" through prompt constraints is the right framing. The base model matters less than how tightly you can constrain the failure modes that actually hurt your users.

The cost dimension you mentioned is worth expanding on: for pre-revenue projects where every API call eats into runway, running a local model eliminates the per-token anxiety entirely. The tradeoff is you lose the responseSchema guarantee and have to build your own validation layer — but for batch generation where you can retry failures, that's often worth it. Curious whether you've considered a hybrid approach: local model for draft generation, cloud API for final validation pass on the structured output.

William Wang • Mar 16

Smart approach evaluating on production constraints instead of benchmarks. The rate limit and cost per token math is what most comparison articles skip — they benchmark quality in isolation but ignore that a 20% better model at 3x the cost is a net negative for most production workloads.