An evaluation of the Gemini 2.5 flash and Claude 3.7 Sonnet model for an agentic engine.
I had a simple rule when choosing an LLM for Ozigi: don't pick based on benchmark leaderboards. After my v2 launch, in recieving feedback, a user suggested I use the Claude models as they were better for content generation than Gemini. While the suggestion sounded tempting, I had to pick a model based on the four constraints my production pipeline couldn't negotiate around.
Most "Gemini vs Claude" comparisons evaluate general-purpose capabilities like coding, reasoning, and creative writing. That's useful if you're building a general-purpose product.
I wasn't.
Ozigi is a content engine. You feed it a URL, a PDF, or raw notes. It returns a structured 3-day social media campaign as a JSON payload that the frontend maps directly into UI cards.
That specificity made the evaluation easier than I expected: Two models, Four constraints. One clear winner on three of the constraints.
This is the third post in the Ozigi Changelog Series. If you want the backstory on why Ozigi exists, start with how I vibe-coded the internal tool that became it, and the v2 changelog that introduced the modular architecture this decision was built on.
Here's the full Architecture Decision Record.
The Setup: What the Pipeline Actually Does
The core API route in Ozigi does this:
- Accepts a
multipart/form-datapayload containing a URL, raw text, and/or a file (PDF or image) - Constructs a prompt with strict editorial constraints injected at the system level
- Sends everything to the LLM via the Vertex AI Node.js SDK
- Returns the raw text response directly to the client
The frontend then does this:
const parsed = JSON.parse(responseText);
setCampaign(parsed.campaign);
No middleware. No schema validation. No error recovery in the happy path. Raw parse, straight into React state.
That single line is why model selection mattered.
Constraint 1: Comparing Gemini vs Claude Models for JSON Output Stability
The requirement: The model must return a valid JSON object — every time, without wrapping it in markdown code fences, without adding a conversational preamble, and without hallucinating a trailing comma that breaks JSON.parse().
The target schema looks like this:
{
"campaign": [
{ "day": 1, "x": "...", "linkedin": "...", "discord": "..." },
{ "day": 2, "x": "...", "linkedin": "...", "discord": "..." },
{ "day": 3, "x": "...", "linkedin": "...", "discord": "..." }
]
}
It renders nine posts across three platforms in a span of three days, with every field required.
The UI renders each field into a separate card with edit, copy, and publish actions. A missing key doesn't throw a visible error — it silently renders an empty card.
I ran 500 automated test generations against both models targeting this schema, measuring the percentage of responses that JSON.parse() accepted without exceptions.
| Model | Format Adherence Rate |
|---|---|
| Gemini 2.5 Flash | 99.9% |
| Claude 3.7 Sonnet (prompted) | ~88.5% |
The 11.5% gap maps directly to broken UI states for real users. That was not acceptable to me for a core feature.
Using Gemini's responseSchema closes this entirely. According to Google's controlled generation documentation, the feature physically prevents the model from returning output that doesn't conform to your schema. It's not prompt-level guidance, it's enforced at the decoding layer. Here's what the production implementation looks like for Ozigi: the schema is defined once at the top of the route and attached directly to the model config:
const distributionSchema = {
type: "OBJECT" as const,
properties: {
campaign: {
type: "ARRAY" as const,
description: "A list of 3 daily social media posts.",
items: {
type: "OBJECT" as const,
properties: {
day: { type: "INTEGER" as const, description: "Day number (1, 2, or 3)" },
x: { type: "STRING" as const, description: "Content for X/Twitter." },
linkedin: { type: "STRING" as const, description: "Content for LinkedIn." },
discord: { type: "STRING" as const, description: "Content for Discord." },
},
required: ["day", "x", "linkedin", "discord"],
},
},
},
required: ["campaign"],
};
const model = vertex_ai.getGenerativeModel({
model: "gemini-2.5-flash",
generationConfig: {
responseMimeType: "application/json",
responseSchema: distributionSchema,
},
});
response.text() is now structurally guaranteed to be valid JSON. JSON.parse() cannot fail on a missing field, trailing comma, or conversational preamble — the model is physically prevented from producing them.
Claude's tool use and function calling can achieve similar guarantees, but it requires a meaningfully different integration architecture. With the Vertex SDK, this is one config block.
Winner: Gemini.
Constraint 2: Comparing Gemini vs Claude on Latency on a Live Public Sandbox
The requirement: Ozigi has a free, unauthenticated sandbox. Anyone can generate a full 3-day campaign without signing up.
That changes the economics of model selection completely. A paying user on a premium plan will tolerate a 20-second wait if the output quality justifies it. An anonymous user who found the product via my whacky marketing efforts will not. They'll close the tab at 10 seconds and probably not come back, sadly.
I benchmarked both models against a standard 10,000-token input payload via Vercel serverless functions (my production environment):
| Model | Avg Response Time |
|---|---|
| Gemini 2.5 Flash | ~6.2s |
| Claude 3.7 Sonnet | ~21.5s |
Methodology: N=100 requests per model, measured end-to-end from Vercel function invocation to full response. Results are environment-dependent and intended for directional comparison, not as absolute benchmarks.
The gap holds across payload sizes. Gemini Flash consistently comes in under 10-15 seconds. Claude 3.7 Sonnet consistently exceeds 20 seconds on the same inputs, in the same environment.
This gap would narrow significantly with streaming: getting first tokens in front of the user within 2-3 seconds. Streaming changes the perceived wait time for a user entirely. This is, however, a v4 architecture item that is being worked on. For a non-streaming pipeline with a public sandbox, the 3.5x latency difference is a product decision, not just an engineering one.
Winner: Gemini Flash — and it's not close for non-streaming public sandboxes.
Constraint 3: Comparing Gemini vs Claude on Native Multimodal Ingestion
The requirement: Users can upload PDFs and images directly as context. The pipeline needs to process them without an external preprocessing step.
With Gemini via the Vertex AI Node.js SDK, the entire PDF pipeline is:
// /app/api/generate/route.ts
if (file && file.size > 0) {
const arrayBuffer = await file.arrayBuffer();
const base64Data = Buffer.from(arrayBuffer).toString("base64");
parts.push({
inlineData: {
data: base64Data,
mimeType: file.type, // "application/pdf", "image/jpeg", etc.
},
});
}
const result = await model.generateContent({
contents: [{ role: "user", parts: parts }],
});
You can see that the SDK handles the buffer natively. Gemini reads the PDF directly as part of the multipart request alongside the text prompt — no OCR step, no preprocessing, no separate service call. Google's multimodal documentation confirms that Gemini was designed from the ground up to handle PDF and image buffers natively via inlineData.
The alternative with Claude's standard API would require extracting text from the PDF before passing it to the model, typically via a dedicated OCR service like AWS Textract or an open-source library. That introduces an additional API dependency, an additional failure point, and added latency on every file upload request. For a solo-built product running on Vercel serverless functions like Ozigi, that infrastructure complexity wasn't justified.
Claude does support document uploads via the Messages API with base64 encoding for PDFs, but the Vertex AI SDK's native inlineData handling made Gemini the simpler path for the existing stack.
Winner: Gemini — native multimodal support eliminated an entire infrastructure layer.
Constraint 4: Comparing Google Gemini vs Claude on Tone Engineering
The requirement: Generated social media posts must sound like a human wrote them. Specifically, they must pass AI content detection and avoid the predictable cadence patterns that make AI-generated copy immediately identifiable.
This is the constraint where Claude wins cleanly on base performance.
Our internal blind A/B evaluations of 50 technical posts (scored on pragmatic sentence structure and absence of AI terminology) gave Claude 3.7 Sonnet a "human cadence quality score" of 9.5/10. Gemini Flash's base score was 5.5/10.
That's a significant gap. And it's for the feature that is Ozigi's core value proposition.
Why use Gemini for Tone Engineering?
Because the gap is engineerable.
We built the Banned Lexicon — a programmatic constraint injected at the system prompt level that explicitly penalizes the vocabulary patterns that make AI copy detectable. You can read the full implementation in the Ozigi documentation:
THE BANNED LEXICON: You are strictly forbidden from using the
following words or their variations: delve, testament, tapestry,
crucial, vital, landscape, realm, unlock, supercharge, revolutionize,
paradigm, seamlessly, navigate, robust, cutting-edge, game-changer.
Combined with explicit cadence engineering:
BURSTINESS (CADENCE): Write with high burstiness. Do not use
perfectly balanced, medium-length sentences. Mix extremely short,
punchy sentences (2-4 words) with longer, detailed explanations.
PERPLEXITY: Avoid predictable adjectives. Use strong, active verbs
and concrete nouns. Talk like a pragmatic subject matter expert
explaining a concept to people, not a marketer selling a product.
FORMATTING RESTRAINT: You are limited to a MAXIMUM of 1 emoji per
post. Use a maximum of 2 highly relevant hashtags per post.
With these constraints active, Gemini's human cadence score jumps from 5.5 to 9.2 — within acceptable range of Claude's base 9.5.
The key insight: Claude's tone advantage is a default advantage, not an absolute one. Gemini's outputs are more malleable under prompt constraints. For a use case where tone control is the entire product, that malleability is worth more than a higher baseline.
Winner: Gemini + engineering constraints. The tone gap is closeable. The latency and JSON stability gaps on the other constraints are not.
Gemini vs Claude Models: The Cost Reality
At this stage where Ozigi is a public sandbox, every anonymous page load that can trigger a generation is a billable API call absorbed by the product. Ozigi is at its pre-revenue stage, so this matters a lot.
| Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) |
|---|---|---|
| Gemini 2.5 Flash | ~$0.075 | ~$0.30 |
| Claude 3.7 Sonnet | ~$3.00 | ~$15.00 |
Pricing sourced from Google Cloud Vertex AI pricing and Anthropic API pricing.
Pro tip:Verify current rates before production decisions — both have changed multiple times in the past year.
The input cost difference is 40x. The output cost difference is 50x. For a free-tier product with no revenue, the ability to run a public sandbox sustainably is the difference between having a conversion funnel and not having one.
Where Ozigi is Going and How it'd Change My Choice of Model, Moving Foward
This is an honest ADR. Here's what would change my answer.
When Ozigi finally moves behind a paywall, latency and cost become secondary concerns. A signed-in user on a paid plan is more likely waiting 20 seconds for premium output is a different UX calculation than an anonymous user on a free demo. In that context, Claude's base tone quality becomes much more compelling. I'd be trading economics for output baseline, and the trade might be worth it.
When streaming gets implemented, the latency argument against Claude weakens significantly. Claude 3.7 Sonnet's time-to-first-token via streaming is competitive. A user seeing the first post appear in 2-3 seconds experiences the product very differently than a user staring at a progress bar for 21 seconds. Streaming is on the roadmap.
For an in-depth look at how we tested the pipeline that informs these decisions, see how we E2E test AI agents with Playwright in Next.js.
The Decision Matrix
| Constraint | Gemini 2.5 Flash | Claude 3.7 Sonnet | Winner |
|---|---|---|---|
| JSON Stability (responseSchema) | 99.9% → guaranteed | ~88.5% (prompted) | Gemini |
| Latency (non-streaming) | ~6.2s | ~21.5s | Gemini |
| Native PDF/Image ingestion | Native via Vertex SDK | Requires preprocessing | Gemini |
| Base tone quality | 5.5/10 | 9.5/10 | Claude |
| Tone quality (+ constraints) | 9.2/10 | 9.5/10 | Near tie |
| Cost per 1M input tokens | $0.075 | $3.00 | Gemini |
Gemini won on five of six dimensions. Claude won on one — base tone — and that gap was closeable through prompt engineering.
Four Questions To Ask Before Choosing An LLM Model For Your Agentic Project/APP
If you're building something similar to Ozigi, these are the constraints worth looking through before you pick an API and start building:
1. Does your UI depend on structured output? If your frontend calls JSON.parse() on a raw model response, you need API-level schema enforcement — not prompt instructions asking nicely. responseSchema via Vertex AI or structured outputs via OpenAI are the right tools. Prompt-instructed JSON is a liability.
2. Do you have a free tier or public sandbox? If yes, latency and cost are product decisions that affect conversion, not just infrastructure decisions that affect margins.
3. Does your use case require multimodal inputs? If yes, map out the full ingestion pipeline for each model before evaluating output quality. A simpler pipeline is a more reliable pipeline.
4. Where is the base model weakest, and is that gap engineerable? Claude's tone advantage is real. It's also not the only path to human-sounding copy. Engineering constraints at the prompt level can close gaps that feel insurmountable when you're just looking at base benchmarks.
The best model for your product is rarely the one with the highest aggregate score. It's the one that fails least on the constraints you actually can't work around.
- The full Ozigi architecture — including the generate API route, the Banned Lexicon implementation, and the Vertex AI configuration — is open source on GitHub.
- The live context engine is at ozigi.app.
- The interactive version of this ADR with Chart.js visualisations of each benchmark.
- Ozigi is currently looking for User Experience Testers to give honest Feedback on their experience using the product, and areas for improvement.
- We have some open issues on Github that is welcome to contribution from the community. ps, this app has been entirely vibe coded so far, therefore we welcome vibe coded contributions too!
- Connect With Me On LinkedIn
- Send me an email on okolodumebi@gmail.com.
- Building osmething cool? Talk about it in the comments!





Top comments (2)
Really solid ADR writeup. The Banned Lexicon approach is clever — I've been doing something similar with negative constraints in my own prompts and it works way better than just asking the model to "sound natural."
One thing I'm curious about: have you noticed the JSON stability gap changing with newer Claude versions? I ran into the markdown code fence issue constantly with 3.5 Sonnet but it got noticeably better after they shipped tool_use improvements. Still not at Gemini's responseSchema level though, that's basically cheating in the best way possible.
The 40x cost difference is wild. At pre-revenue that basically makes the decision for you regardless of everything else.
This is a super interesting writeup. Mirrors some of my own anecdotal experience as well. Thanks for sharing!