Alan West

Posted on May 20

Gemini 3.5 Flash vs Claude Haiku vs GPT-4o mini: Picking a Small Model

#python #programming #webdev #ai

The Gemini 3.5 Flash announcement made the rounds on Hacker News this week, and I've been getting pinged by teammates asking whether to migrate our internal tools off Claude Haiku or GPT-4o mini. After spending a weekend running informal evals on classification and code-routing tasks, I have some takes.

Upfront caveat: I'm hedging on specific Gemini 3.5 Flash feature claims because marketing benchmarks and actual API behavior rarely match. Check the official Gemini docs before you bet your architecture on any number — including the ones I quote.

Why migrate small models at all?

Most teams I work with use small, fast models for the boring stuff:

Classification (is this ticket billing, bug, or feature-request?)
Summarization (TL;DR a long Slack thread)
Code routing in agent setups
Lightweight extraction (pull dates and amounts out of an invoice)

You don't need a frontier model for any of these. You need something cheap, fast, and consistent. Frontier models are actually worse for tool routing in my experience — they overthink and burn latency on tasks that should be reflexive.

So when a new small model lands, it's worth 30 minutes of evaluation. Not a migration. Just a check.

The contenders

Three models I'd consider for a new project today:

Gemini 3.5 Flash — Google's latest fast model. Reportedly faster and cheaper than the 2.5 generation, with a very long context window. I'll caveat the specifics below.
Claude Haiku 4.5 — Anthropic's small model. I've used Haiku in production for two years; it's my default for anything that needs structured outputs or tool calling.
GPT-4o mini — OpenAI's small model. Still solid, still has the deepest ecosystem support.

I'm leaving out Llama and Mistral on purpose. Self-hosted open models deserve their own post.

Side-by-side: calling the APIs

Here's what each looks like in practice. All three have official SDKs, and the shape is similar but not identical.

Gemini 3.5 Flash

# pip install google-generativeai
import google.generativeai as genai

genai.configure(api_key="YOUR_KEY")

# verify the exact model ID in the official docs before shipping
model = genai.GenerativeModel("gemini-3.5-flash")

response = model.generate_content(
    "Classify this ticket: 'Cannot log in after password reset.'",
    generation_config={"temperature": 0.2}  # low temp for classification
)
print(response.text)

Claude Haiku 4.5

# pip install anthropic
from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY from env

msg = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=200,
    messages=[{
        "role": "user",
        "content": "Classify this ticket: 'Cannot log in after password reset.'"
    }]
)
print(msg.content[0].text)

GPT-4o mini

# pip install openai
from openai import OpenAI

client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": "Classify this ticket: 'Cannot log in after password reset.'"
    }],
    temperature=0.2
)
print(resp.choices[0].message.content)

Three different shapes for the same operation. If you've been around long enough you know the drill — wrap it in your own interface and stop caring about the cosmetic differences.

Migrating: a thin adapter

I migrated one of our pipelines from GPT-4o mini to Gemini Flash last quarter (the 2.5 generation, not the new one). The pattern that saved me:

# adapter.py — keeps business logic provider-agnostic
class LLMClient:
    def __init__(self, provider: str):
        self.provider = provider
        if provider == "gemini":
            import google.generativeai as genai
            self.model = genai.GenerativeModel("gemini-3.5-flash")
        elif provider == "claude":
            from anthropic import Anthropic
            self.client = Anthropic()
        elif provider == "openai":
            from openai import OpenAI
            self.client = OpenAI()
        else:
            raise ValueError(f"Unknown provider: {provider}")

    def classify(self, prompt: str) -> str:
        # one method, three implementations — callers stay clean
        if self.provider == "gemini":
            return self.model.generate_content(prompt).text
        if self.provider == "claude":
            msg = self.client.messages.create(
                model="claude-haiku-4-5",
                max_tokens=200,
                messages=[{"role": "user", "content": prompt}]
            )
            return msg.content[0].text
        # openai
        resp = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}]
        )
        return resp.choices[0].message.content

Boring? Yes. But this adapter saved me a week when we needed to A/B test the new Gemini model. Flip a config flag, replay the same prompts, compare outputs. Zero business logic changed.

Tradeoffs I'm willing to commit to

Things I feel confident about from actually running these in production:

Claude Haiku is the most predictable for structured outputs and tool calling. If your downstream code parses JSON, start here.
GPT-4o mini has the deepest ecosystem — every framework, every tutorial, every Stack Overflow answer. If your team is junior, the docs alone are worth it.
Gemini Flash has historically had the largest context window of the three. Useful when you need to dump entire codebases or long docs into a prompt.

Things I'm explicitly hedging on for 3.5 Flash:

The benchmark numbers in Google's announcement look great. Benchmarks always look great. Run your own evals.
I haven't tested its tool-calling reliability against Haiku yet. That's next month's project.
Pricing on the official page is the only number I trust. Check it before architecting around cost assumptions.

Which one should you pick?

Honest answer: it depends on what you're already running.

Already on Haiku and it works? Don't migrate. The savings won't pay for the engineering time.
Building something new with massive context needs? Start with Gemini 3.5 Flash. Long-context is its lane.
Need bulletproof tool calling for agents? Claude Haiku, every time.
Just shipping a side project and want zero friction? GPT-4o mini.

The dirty secret of LLM migrations is that model choice matters way less than your prompts and your eval suite. Spend the time on evals first. Then pick whichever model wins your benchmark — not whichever one trended on Hacker News last week.

DEV Community