Dumebi Okolo

Posted on Mar 10

Gemini 2.5 Flash vs Claude 3.7 Sonnet: 4 Production Constraints That Made the Decision for Me

#nextjs #javascript #showdev #webdev

An evaluation of the Gemini 2.5 flash and Claude 3.7 Sonnet model for an agentic engine.

I had a simple rule when choosing an LLM for Ozigi: don't pick based on benchmark leaderboards. After my v2 launch, in recieving feedback, a user suggested I use the Claude models as they were better for content generation than Gemini. While the suggestion sounded tempting, I had to pick a model based on the four constraints my production pipeline couldn't negotiate around.

Most "Gemini vs Claude" comparisons evaluate general-purpose capabilities like coding, reasoning, and creative writing. That's useful if you're building a general-purpose product.
I wasn't.
Ozigi is a content engine. You feed it a URL, a PDF, or raw notes. It returns a structured 3-day social media campaign as a JSON payload that the frontend maps directly into UI cards.

That specificity made the evaluation easier than I expected: Two models, Four constraints. One clear winner on three of the constraints.

This is the third post in the Ozigi Changelog Series. If you want the backstory on why Ozigi exists, start with how I vibe-coded the internal tool that became it, and the v2 changelog that introduced the modular architecture this decision was built on.

Here's the full Architecture Decision Record.

The Setup: What the Pipeline Actually Does

The core API route in Ozigi does this:

Accepts a multipart/form-data payload containing a URL, raw text, and/or a file (PDF or image)
Constructs a prompt with strict editorial constraints injected at the system level
Sends everything to the LLM via the Vertex AI Node.js SDK
Returns the raw text response directly to the client

The frontend then does this:

const parsed = JSON.parse(responseText);
setCampaign(parsed.campaign);

No middleware. No schema validation. No error recovery in the happy path. Raw parse, straight into React state.

That single line is why model selection mattered.

Constraint 1: Comparing Gemini vs Claude Models for JSON Output Stability

The requirement: The model must return a valid JSON object — every time, without wrapping it in markdown code fences, without adding a conversational preamble, and without hallucinating a trailing comma that breaks JSON.parse().

The target schema looks like this:

{
  "campaign": [
    { "day": 1, "x": "...", "linkedin": "...", "discord": "..." },
    { "day": 2, "x": "...", "linkedin": "...", "discord": "..." },
    { "day": 3, "x": "...", "linkedin": "...", "discord": "..." }
  ]
}

It renders nine posts across three platforms in a span of three days, with every field required.
The UI renders each field into a separate card with edit, copy, and publish actions. A missing key doesn't throw a visible error — it silently renders an empty card.
This comparison is specifically between Gemini with responseSchema enforcement and Claude with prompted JSON, not between each model's structural output ceiling. Claude's tool use with tool_choice: {type: "tool"} enforces schema at the decoding layer and can reach equivalent reliability. The relevant constraint here was which enforcement mechanism was available and practical within my existing stack. More on that below.
I ran 500 automated test generations against both models targeting this schema, measuring the percentage of responses that JSON.parse() accepted without exceptions.

Model	Format Adherence Rate
Gemini 2.5 Flash	99.9%
Claude 3.7 Sonnet (prompted)	~88.5%

The 11.5% gap maps directly to broken UI states for real users. That was not acceptable to me for a core feature.

Using Gemini's responseSchema closes this entirely. According to Google's controlled generation documentation, the feature physically prevents the model from returning output that doesn't conform to your schema. It's not prompt-level guidance, it's enforced at the decoding layer. Here's what the production implementation looks like for Ozigi: the schema is defined once at the top of the route and attached directly to the model config:

const distributionSchema = {
  type: "OBJECT" as const,
  properties: {
    campaign: {
      type: "ARRAY" as const,
      description: "A list of 3 daily social media posts.",
      items: {
        type: "OBJECT" as const,
        properties: {
          day:      { type: "INTEGER" as const, description: "Day number (1, 2, or 3)" },
          x:        { type: "STRING"  as const, description: "Content for X/Twitter." },
          linkedin: { type: "STRING"  as const, description: "Content for LinkedIn." },
          discord:  { type: "STRING"  as const, description: "Content for Discord." },
        },
        required: ["day", "x", "linkedin", "discord"],
      },
    },
  },
  required: ["campaign"],
};

const model = vertex_ai.getGenerativeModel({
  model: "gemini-2.5-flash",
  generationConfig: {
    responseMimeType: "application/json",
    responseSchema: distributionSchema,
  },
});

response.text() is now structurally guaranteed to be valid JSON. JSON.parse() cannot fail on a missing field, trailing comma, or conversational preamble — the model is physically prevented from producing them.
Claude's tool use and function calling can achieve similar guarantees, but it requires a meaningfully different integration architecture. With the Vertex SDK, this is one config block.

Winner: Gemini.

Constraint 2: Comparing Gemini vs Claude on Latency on a Live Public Sandbox

The requirement: Ozigi has a free, unauthenticated sandbox. Anyone can generate a full 3-day campaign without signing up.

That changes the economics of model selection completely. A paying user on a premium plan will tolerate a 20-second wait if the output quality justifies it. An anonymous user who found the product via my whacky marketing efforts will not. They'll close the tab at 10 seconds and probably not come back, sadly.

I benchmarked both models against a standard 10,000-token input payload via Vercel serverless functions (my production environment):

Model	Avg Response Time
Gemini 2.5 Flash	~6.2s
Claude 3.7 Sonnet	~21.5s

Methodology: N=100 requests per model, measured end-to-end from Vercel function invocation to full response. Results are environment-dependent and intended for directional comparison, not as absolute benchmarks.

The gap holds across payload sizes. Gemini Flash consistently comes in under 10-15 seconds. Claude 3.7 Sonnet consistently exceeds 20 seconds on the same inputs, in the same environment.

This gap would narrow significantly with streaming: getting first tokens in front of the user within 2-3 seconds. Streaming changes the perceived wait time for a user entirely. This is, however, a v4 architecture item that is being worked on. For a non-streaming pipeline with a public sandbox, the 3.5x latency difference is a product decision, not just an engineering one.

Winner: Gemini Flash — and it's not close for non-streaming public sandboxes.

Constraint 3: Comparing Gemini vs Claude on Native Multimodal Ingestion

The requirement: Users can upload PDFs and images directly as context. The pipeline needs to process them without an external preprocessing step.

With Gemini via the Vertex AI Node.js SDK, the entire PDF pipeline is:

// /app/api/generate/route.ts
if (file && file.size > 0) {
  const arrayBuffer = await file.arrayBuffer();
  const base64Data = Buffer.from(arrayBuffer).toString("base64");

  parts.push({
    inlineData: {
      data: base64Data,
      mimeType: file.type, // "application/pdf", "image/jpeg", etc.
    },
  });
}

const result = await model.generateContent({
  contents: [{ role: "user", parts: parts }],
});

You can see that the SDK handles the buffer natively. Gemini reads the PDF directly as part of the multipart request alongside the text prompt — no OCR step, no preprocessing, no separate service call. Google's multimodal documentation confirms that Gemini was designed from the ground up to handle PDF and image buffers natively via inlineData.

An earlier version of this article claimed that Claude required an external OCR step for PDF ingestion. That was wrong. Claude's Messages API does support native base64 PDF ingestion directly via a document content block — no OCR preprocessing, no external service. The pattern is structurally similar to Vertex AI's inlineData, just different field names.
The real constraint here was ecosystem, not capability. I evaluated Claude 3.7 Sonnet as available in the Google Model Garden within my existing Vertex AI setup. Switching to Claude's native PDF ingestion would have meant moving to the Anthropic Messages API entirely — a different provider, different SDK, different billing. The Vertex AI path was simpler for the stack I was already running.
Winner: Gemini — for this stack. Both models support native multimodal ingestion without external OCR. The advantage here was ecosystem fit, not a fundamental capability difference.

Constraint 4: Comparing Google Gemini vs Claude on Tone Engineering

The requirement: Generated social media posts must sound like a human wrote them. Specifically, they must pass AI content detection and avoid the predictable cadence patterns that make AI-generated copy immediately identifiable.

This is the constraint where Claude wins cleanly on base performance.
Our internal blind A/B evaluations of 50 technical posts (scored on pragmatic sentence structure and absence of AI terminology) gave Claude 3.7 Sonnet a "human cadence quality score" of 9.5/10. Gemini Flash's base score was 5.5/10.

That's a significant gap. And it's for the feature that is Ozigi's core value proposition.

Why use Gemini for Tone Engineering?

Because the gap is engineerable.

We built the Banned Lexicon — a programmatic constraint injected at the system prompt level that explicitly penalizes the vocabulary patterns that make AI copy detectable. You can read the full implementation in the Ozigi documentation:

THE BANNED LEXICON: You are strictly forbidden from using the 
following words or their variations: delve, testament, tapestry, 
crucial, vital, landscape, realm, unlock, supercharge, revolutionize, 
paradigm, seamlessly, navigate, robust, cutting-edge, game-changer.

Combined with explicit cadence engineering:

BURSTINESS (CADENCE): Write with high burstiness. Do not use 
perfectly balanced, medium-length sentences. Mix extremely short, 
punchy sentences (2-4 words) with longer, detailed explanations.

PERPLEXITY: Avoid predictable adjectives. Use strong, active verbs 
and concrete nouns. Talk like a pragmatic subject matter expert 
explaining a concept to people, not a marketer selling a product.

FORMATTING RESTRAINT: You are limited to a MAXIMUM of 1 emoji per 
post. Use a maximum of 2 highly relevant hashtags per post.

With these constraints active, Gemini's human cadence score jumps from 5.5 to 9.2 — within acceptable range of Claude's base 9.5.

The key insight: Claude's tone advantage is a default advantage, not an absolute one. Gemini's outputs are more malleable under prompt constraints. For a use case where tone control is the entire product, that malleability is worth more than a higher baseline.

Winner: Gemini + engineering constraints. The tone gap is closeable. The latency and JSON stability gaps on the other constraints are not.

Gemini vs Claude Models: The Cost Reality

At this stage where Ozigi is a public sandbox, every anonymous page load that can trigger a generation is a billable API call absorbed by the product. Ozigi is at its pre-revenue stage, so this matters a lot.

Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)
Gemini 2.5 Flash	~$0.075	~$0.30
Claude 3.7 Sonnet	~$3.00	~$15.00

Pricing sourced from Google Cloud Vertex AI pricing and Anthropic API pricing.

Pro tip:Verify current rates before production decisions — both have changed multiple times in the past year.

The input cost difference is 40x. The output cost difference is 50x. For a free-tier product with no revenue, the ability to run a public sandbox sustainably is the difference between having a conversion funnel and not having one.

Where Ozigi is Going and How it'd Change My Choice of Model, Moving Foward

This is an honest ADR. Here's what would change my answer.

When Ozigi finally moves behind a paywall, latency and cost become secondary concerns. A signed-in user on a paid plan is more likely waiting 20 seconds for premium output is a different UX calculation than an anonymous user on a free demo. In that context, Claude's base tone quality becomes much more compelling. I'd be trading economics for output baseline, and the trade might be worth it.

When streaming gets implemented, the latency argument against Claude weakens significantly. Claude 3.7 Sonnet's time-to-first-token via streaming is competitive. A user seeing the first post appear in 2-3 seconds experiences the product very differently than a user staring at a progress bar for 21 seconds. Streaming is on the roadmap.

For an in-depth look at how we tested the pipeline that informs these decisions, see how we E2E test AI agents with Playwright in Next.js.

The Decision Matrix

Constraint	Gemini 2.5 Flash	Claude 3.7 Sonnet	Winner
JSON Stability (responseSchema)	99.9% → guaranteed	~88.5% (prompted)	Gemini
Latency (non-streaming)	~6.2s	~21.5s	Gemini
Native PDF/Image ingestion	Native via Vertex SDK	Native via Messages API	Gemini (Eco-system fit)
Base tone quality	5.5/10	9.5/10	Claude
Tone quality (+ constraints)	9.2/10	9.5/10	Near tie
Cost per 1M input tokens	$0.075	$3.00	Gemini

Gemini won on five of six dimensions. Claude won on one — base tone — and that gap was closeable through prompt engineering.

Four Questions To Ask Before Choosing An LLM Model For Your Agentic Project/APP

If you're building something similar to Ozigi, these are the constraints worth looking through before you pick an API and start building:

**1. Does your UI depend on structured output? If your frontend calls JSON.parse() on a raw model response, you need API-level schema enforcement — not prompt instructions asking nicely. responseSchema via Vertex AI, Claude's tool use with forced tool_choice, or structured outputs via OpenAI all enforce at the decoding layer. The question isn't which model supports it — most do — it's which enforcement path fits your existing stack.

2. Do you have a free tier or public sandbox? If yes, latency and cost are product decisions that affect conversion, not just infrastructure decisions that affect margins.

**3. Does your use case require multimodal inputs? Most major models now support native PDF and image ingestion without external preprocessing. Map out what the integration looks like within your existing API provider before assuming you need to switch or add infrastructure

4. Where is the base model weakest, and is that gap engineerable? Claude's tone advantage is real. It's also not the only path to human-sounding copy. Engineering constraints at the prompt level can close gaps that feel insurmountable when you're just looking at base benchmarks.

The best model for your product is rarely the one with the highest aggregate score. It's the one that fails least on the constraints you actually can't work around.

The full Ozigi architecture — including the generate API route, the Banned Lexicon implementation, and the Vertex AI configuration — is open source on GitHub.
The live context engine is at ozigi.app.
The interactive version of this ADR with Chart.js visualisations of each benchmark.
Ozigi is currently looking for User Experience Testers to give honest Feedback on their experience using the product, and areas for improvement.
We have some open issues on Github that is welcome to contribution from the community. ps, this app has been entirely vibe coded so far, therefore we welcome vibe coded contributions too!
Connect With Me On LinkedIn
Send me an email on okolodumebi@gmail.com.
Building osmething cool? Talk about it in the comments!

Top comments (23)

Anmol Baranwal • Mar 11

I was actually writing a blog post several months ago on which model is best for which use case -- where I studied 20+ models on 15+ parameters & use cases so I know how hard it is to decide

real results vs benchmarks gap is always there in my experience. still think claude sounds the most human but the Banned Lexicon idea is pretty smart.

one more thing: you can use "help wanted" labels on your issues (just noticed v4-priority). when developers search for open source projects to contribute to, it's a default filter!

let me go through this over the weekend & I will message you.

Dumebi Okolo • Mar 11

I plan to eventually migrate the engine to Claude.
But, so far, Gemini gives me what I want.
As for the Github issues, I guess I was a bit shy. 😅
I wasn't sure if people would be interested in contributing or the project entirely, so I set the issues tab as a sort of Trello board to track progress and updates.

Dumebi Okolo • Mar 11

Thanks, Anmol!

Kai Alder • Mar 10

Really solid ADR writeup. The Banned Lexicon approach is clever — I've been doing something similar with negative constraints in my own prompts and it works way better than just asking the model to "sound natural."

One thing I'm curious about: have you noticed the JSON stability gap changing with newer Claude versions? I ran into the markdown code fence issue constantly with 3.5 Sonnet but it got noticeably better after they shipped tool_use improvements. Still not at Gemini's responseSchema level though, that's basically cheating in the best way possible.

The 40x cost difference is wild. At pre-revenue that basically makes the decision for you regardless of everything else.

Dumebi Okolo • Mar 10

Yes, the cost is the most important thing I looked out for.
With two weeks of constantly running the engine, I have burnt through my free Google API credits.

Tommy Leonhardsen • Mar 11

If cost is the primary driver, skip the commercial APIs entirely. Qwen3.5 via Ollama Cloud is $20/month flat-rate unlimited. For a structured JSON generation task like yours it's more than capable, and you'll never burn through free credits again because there are none — just a fixed monthly ceiling.

Dumebi Okolo • Mar 11

What of the quality and human cadence score of the content it produces? That's something to consider.
I used Qwen a lot last year, but for some reasons, it never just cut it for me.

Plus, it's actually easier to work with $1300 in free credit, than to pay $20 in credit seeing as we are pre revenue. 😅

Tommy Leonhardsen • Mar 11

Latest Qwen models are very good. So is a lot of the models available in Ollama Cloud.

If you cannot afford 20usd/month your company is in dire economic straits.

Anyways, best of luck to you going forwards.
I am fairly certain the world does not really NEED more AI generated ads, but you do you!

Dumebi Okolo • Mar 11 • Edited

Thank you very much for your good wishes.

I just launched this product, and I am not making any revenue from it yet. Every investment I have made has been out-of-pocket. [If you are willing to angel-invest in us, I will be very grateful for this gesture.]
I am not required to use Qwen models or models available in the Ollama cloud. Gemini works very well for me at this time, and I will be moving over to Claude in due time.
The solution Ozigi provides isn't for the world to be filled with more AI-generated ads as you call them. It's to solve the blank page syndrome writers like me face when coming up with social media content copy. If you had used the app, you'd notice that there is an edit button. This signals to people to not trust 100% what the AI generates but to put in your own voice or opinion. The engine simply gives you a place to start from.

I appreciate your concern on safeguarding the internet from AI content, as I am also passionate about this. We are all just trying to build solutions that help people in their workflows.

Tommy Leonhardsen • Mar 11

The JSON stability comparison isn't really Gemini vs Claude — it's Gemini's responseSchema vs prompted Claude. That's not a fair test.

Claude's tool use and structured outputs enforce schema at the decoding layer exactly like responseSchema does. A simple tool definition with your campaign schema and tool_choice: {type: "tool", name: "generate_campaign"} gets you to 100% parse reliability, same guarantee, no middleware.

The PDF claim also doesn't hold up. Claude has supported native base64 PDF ingestion directly in the Messages API for a long time — no OCR step, no external service. Same inlineData pattern, different field names.

Latency and cost comparisons are legitimate and Gemini Flash wins those on current pricing. But two of your four constraints were implementation gaps, not model limitations.

Also worth noting: Claude 3.7 Sonnet is over a year old. Claude Sonnet 4.6 and Haiku 4.5 are both faster, cheaper and better than any 3.7 model was. The latency and cost numbers in this article were already outdated when it was published.

Dumebi Okolo • Mar 11

You're right on both counts, and I will be honest about it.

On JSON stability: the framing was I'm precise and a bit clickbaity.
The comparison was
Gemini with responseSchema vs Claude with prompted JSON — not
Gemini vs Claude's structured output ceiling. Claude's tool use
with tool_choice enforcement gets you to the same structural
guarantee at the decoding layer. The article notes this but buries
it in a footnote when it should have been the headline caveat.
I take responsibility for that.

On PDF ingestion: you're correct, and I got this entirely wrong. Claude's
Messages API does support native base64 PDF ingestion via the
document content block — no OCR preprocessing required. The
"5 steps, 2 additional failure points" claim was based on a
misread of the integration path and shouldn't have made it into
the article as written. I'll update that section.

On model choice: I was already inside the Vertex AI ecosystem and
evaluated models available in the Google Model Garden. Claude 3.7
Sonnet was the version accessible there at the time. But I agree the
article should have stated that more explicitly rather than framing
it as a general Gemini vs Claude evaluation.

The latency and cost numbers are the ones I'm most confident in
because they were measured in my actual production environment
(Vercel serverless, non-streaming, same input payload). Those
comparisons hold.

Thanks for pushing back though. This is the kind of correction that
makes the article worth more than it was when I published it.
I will be revising the article.

Dumebi Okolo • Mar 11

Claude has greatly, already improved in JSON stability, but I missed this.

Dumebi Okolo • Mar 11

Awesome. 👌

Swift • Mar 10

This is a super interesting writeup. Mirrors some of my own anecdotal experience as well. Thanks for sharing!

Dumebi Okolo • Mar 10

Thank you so much!
Glad you enjoyed reading it.

Apex Stack • Mar 16

The JSON stability constraint hits different when your schema isn't just 3 objects — it's 8,000+ structured financial profiles across 12 languages. I run a content generation pipeline with Llama 3 locally for a programmatic SEO site, and the structured output problem scales non-linearly. At 500 generations your 88.5% parse rate means ~57 broken pages. At 10,000 generations that's 1,150 silent failures rendering incorrect P/E ratios or missing dividend data in production.

Your Banned Lexicon approach maps directly to something I've been wrestling with on the content side. We built similar negative constraint lists for financial analysis text — banning phrases like "poised for growth" or "investors should note" that make every stock page read like the same AI wrote it. The gap being "engineerable" through prompt constraints is the right framing. The base model matters less than how tightly you can constrain the failure modes that actually hurt your users.

The cost dimension you mentioned is worth expanding on: for pre-revenue projects where every API call eats into runway, running a local model eliminates the per-token anxiety entirely. The tradeoff is you lose the responseSchema guarantee and have to build your own validation layer — but for batch generation where you can retry failures, that's often worth it. Curious whether you've considered a hybrid approach: local model for draft generation, cloud API for final validation pass on the structured output.

William Wang • Mar 16

Smart approach evaluating on production constraints instead of benchmarks. The rate limit and cost per token math is what most comparison articles skip — they benchmark quality in isolation but ignore that a 20% better model at 3x the cost is a net negative for most production workloads.

View full discussion (23 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.