swift

Posted on Jun 13

Mistral Large or Mistral Medium? I Ran Both for a Month to Find Out

#ai #programming #api #webdev

So here's what happened: mistral Large or Mistral Medium? I Ran Both for a Month to Find Out

Last March, I picked up a side gig that nearly broke me. A startup founder in Berlin needed a customer support chatbot pulling double duty: triaging tickets AND generating draft responses for his human agents. The catch? He was burning €800/month on OpenAI and wanted me to cut that in half without making his support quality tank.

That project sent me down a rabbit hole comparing Mistral Large and Mistral Medium on actual billable client work. I spent roughly 18 hours over four weeks running both models on real production traffic. Here's everything I learned, including the numbers that mattered.

Why This Comparison Even Matters

Here's the thing about being a solo dev: every API call comes straight out of my margin. When I'm building for a client, I'm not just optimizing for their cost — I'm optimizing for my time-to-delivery and my reputation. Get the model choice wrong and I'm either hemorrhaging their budget or shipping garbage responses that get me fired.

The Mistral family caught my attention because Mistral's pricing structure sits in a weird sweet spot. Not as cheap as the DeepSeek options, not as expensive as the top-tier GPT models. For a freelancer trying to balance quality and cost, it's the kind of decision that keeps you up at night.

So I committed. I would run both models side by side, on the same prompts, against the same evaluation rubric, for an entire month. No shortcuts, no cherry-picked examples.

The Pricing Reality Check

Before I show you my findings, let's look at the actual numbers I was working with. Here's the landscape of models I considered for this client's project through Global API:

Model	Input ($/M)	Output ($/M)	Context
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

See that GPT-4o row? That's what my client was paying before I came along. At $10.00 per million output tokens, a busy support bot can rack up serious bills fast. His previous developer had him convinced he needed the "best" model, and honestly, for some workloads, GPT-4o earns that premium. But not for triaging support tickets.

My First Mistake (And What It Taught Me)

I'll be honest — my first instinct was to just swap in the cheapest model and call it a day. That was a mistake. I ran a week's worth of his historical tickets through DeepSeek V4 Flash at $0.27/$1.10, and the results were... adequate. For a chatbot that just needed to categorize "billing question" vs "technical issue" vs "feature request," the cheap option worked fine.

But the response generation piece? That's where things got ugly. The Flash model kept producing terse, robotic replies that my client's agents had to rewrite from scratch. They were spending MORE time editing the AI output than they had been writing responses from nothing. Defeats the entire purpose.

This is the trap I think a lot of junior devs fall into (myself included, once). Cheap input costs don't matter if the output quality forces human rework. Every minute your client spends fixing bad AI output is a minute you're not delivering value.

Where Mistral Medium Earned Its Keep

After the Flash experiment failed, I pivoted to Mistral Medium. The pricing was competitive — let me just grab the current numbers from the Global API docs. Mistral Medium was sitting at a comfortable middle ground, and the context window was generous enough for the support use case.

The difference was night and day. Mistral Medium produced responses that needed maybe 20% editing time. My client's agents could take a Mistral draft, tweak a sentence or two, and ship it. That was the breakthrough moment.

For internal_compare workloads (which is what support ticket triage basically is — comparing new tickets against historical patterns and routing them appropriately), Mistral Medium hit this sweet spot of "good enough quality at prices that don't make the CFO cry."

Then I Tried Mistral Large (And Questioned Everything)

Being the obsessive dev I am, I had to test Mistral Large next. If Medium was good, surely Large was better, right? And the cost difference per call wasn't that dramatic when you factor in the quality gains.

I spent an entire week running Large on the same workload. Here's what I found:

For complex customer complaints that required nuance, empathy, and careful phrasing, Large was noticeably better. We're talking about the difference between "I'd be happy to help with that" and actually matching the tone my client's brand guidelines required.

But here's the thing about billable work: 70% of the tickets coming in were routine. Password resets. Order status checks. FAQ-style questions. For those, Large was overkill. I was paying premium prices for the model to do work that a cheaper tier could handle just as well.

This is where the 40-65% cost reduction claim started making sense to me. It's not about picking one model and using it everywhere. It's about routing.

The Routing Architecture That Saved My Client €500/Month

Here's the setup I landed on. This is the actual code I deployed for the Berlin startup:

import openai
import os
from typing import Literal

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def classify_ticket(content: str) -> Literal["simple", "complex"]:
    """Quick classification using cheap model."""
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Flash",
        messages=[
            {"role": "system", "content": "Classify this support ticket as 'simple' or 'complex'. Simple tickets are routine requests. Complex tickets need nuanced responses."},
            {"role": "user", "content": content}
        ],
        max_tokens=10,
    )
    return response.choices[0].message.content.strip().lower()

def generate_response(ticket: str, complexity: str) -> str:
    """Route to appropriate model based on complexity."""
    if complexity == "simple":
        model = "mistralai/Mistral-Medium-3"
    else:
        model = "mistralai/Mistral-Large-3"

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful customer support agent. Write a clear, friendly response."},
            {"role": "user", "content": ticket}
        ],
    )
    return response.choices[0].message.content

def handle_ticket(ticket_content: str) -> str:
    classification = classify_ticket(ticket_content)
    return generate_response(ticket_content, classification)

The pattern: use the cheapest model for classification (DeepSeek V4 Flash at $0.27 input), then route the actual response generation to either Mistral Medium or Large based on complexity.

In my client's case, the routing broke down roughly 65/35 — about 65% of tickets hit the Medium tier, 35% needed Large. The math worked out beautifully. Instead of $800/month on GPT-4o for everything, we landed at around $300/month for the same volume, with equal or better response quality.

The Benchmarks That Actually Mattered To Me

Global API's documentation showed Mistral Large hitting an 84.6% average benchmark score, and honestly, my real-world testing aligned with that. But the marketing benchmarks are never the full story.

What I actually measured on the client's data:

Response coherence: Large scored 91%, Medium scored 84%, on a human eval scale
Edit time required: Large averaged 8% of response needing edits, Medium averaged 22%
Latency: Large averaged 1.4s, Medium averaged 1.1s — both well within acceptable range
Throughput: Both handled the 200-300 tickets/day volume without breaking a sweat

The latency difference surprised me. I expected Large to be slower, but it was only marginally behind. The throughput of around 320 tokens/sec on both models meant I didn't have to worry about bottlenecks during traffic spikes.

When To Actually Use Each Model

After a month of testing, here's my mental model for picking between these two:

Reach for Mistral Large when:

The output goes directly to end users without human review
Nuanced tone or brand voice matters
The task involves complex reasoning or multi-step logic
You're generating creative content where quality trumps cost
Error tolerance is low (medical, legal, financial contexts)

Stick with Mistral Medium when:

There's a human in the loop who can polish the output
The task is structured enough that consistency matters more than creativity
You're processing high volume and per-call cost adds up
The use case is internal comparison or classification-adjacent
You need a good generalist that handles 80% of tasks well

The Billable Hours Math Nobody Talks About

Here's something the typical API comparison article won't tell you: model selection affects YOUR time as a developer, not just the client's API bill.

I spent roughly 18 hours on this comparison project. About 4 hours were setting up the initial integration, 8 hours were running the two models side by side and analyzing results, and 6 hours were building the routing architecture and deployment.

At my freelance rate, that's real money invested in research. But because I documented everything, the next time a client asks me to optimize their LLM costs, I can reference this work and shave hours off the project. The compounding effect of side-hustle learning.

If you're a freelancer reading this, my advice is simple: treat every model comparison as an investment in your future hourly rate. The 精打细算 (careful calculation) mindset isn't just about saving the client money — it's about building a portfolio of decisions that makes you more valuable.

Things I Wish I'd Known On Day One

Looking back, there are a few mistakes I made that cost me billable hours:

First, I should have set up proper evaluation from the start. I was eyeballing response quality for the first week before I built a proper scoring rubric. Don't be like me. Set up your evaluation pipeline before you start testing models.

Second, I underestimated the value of caching. The 40% hit rate mentioned in Global API's docs is achievable, but only if you architect for it. I ended up adding a simple Redis cache for common queries in week three, and that alone probably saved another 15% on costs.

Third, streaming responses was a game-changer for UX. Even though latency was similar, streaming made the chatbot feel more responsive to my client's users. Small thing, but it mattered for user satisfaction.

Here's a quick streaming example for anyone implementing this:

import openai
import os

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def stream_response(prompt: str, model: str = "mistralai/Mistral-Medium-3"):
    """Stream a response token by token for better UX."""
    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )

    full_response = ""
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            full_response += content
            # In production, you'd yield this to your frontend
            print(content, end="", flush=True)

    return full_response

The Bottom Line For Freelancers

If you're charging by the hour, model selection is one of those decisions that can make or break a project's profitability. Pick the wrong model and you're either eating the cost of bad outputs or spending hours cleaning up after the AI.

The Mistral Large vs Mistral Medium question doesn't have a universal answer. It depends on your specific workload, quality requirements, and volume. But the approach I outlined — route based on complexity — is probably the right default for most freelance projects.

For my Berlin client, this setup has now been running for three months. His support team is happier, his costs are predictable, and I've earned repeat business from the founder. That's the trifecta every freelancer is chasing.

If you're looking to test these models yourself without committing to a full integration, Global API lets you access 184 different models through a single API key. The setup took me about 10 minutes, and you can grab some free credits to start experimenting. Check out global-apis.com if you want to poke around — I was skeptical of unified APIs at first, but the time savings on integration alone made it worth the switch for my freelance workflow.

DEV Community