Stop tuning one model. Route per workload.

#ai #openai #deepseek #programming

"What's the best model?" used to be a meaningful question. Today it has the wrong shape.

The useful version is: best for which workload?

In the pipeline I migrated last month, three workloads ended up on three different models. None of those would have been my answer if you'd asked "what's the best model overall." Each one is the right answer to a more specific question.

Field extraction wants reliability

We pull structured fields from customer support tickets and route them into a workflow tool. Accuracy matters more than anything because a wrong field silently corrupts a downstream system. Speed barely matters; cost matters somewhat.

Landed on GPT-4o-mini. Boring, reliable, structured-output friendly. Cheap enough that we don't think about it. Trying to use a smarter model here would be a category error — the failure mode isn't "the model isn't smart enough," it's "the model occasionally hallucinates a field name that breaks the schema validator." Smaller, more boring models hallucinate less.

Long-context summarization wants context and price

We summarize PR diffs that range from 5K to 80K tokens, posted to Slack. Quality matters, but the difference between 91% and 94% accuracy on "summarize this diff" is invisible to the human reading the Slack message; the difference between $0.49/day and $17.50/day shows up on the next finance review.

Landed on DeepSeek-V3. The accuracy hit is real but small; the cost reduction is large; the context window covers what we need. The fact that DeepSeek speaks the OpenAI wire format meant the migration was a configuration change.

Tool-use agents want strong instruction following

We have an agent that calls 4-6 tools in sequence to draft and send a follow-up email. Each retry is expensive (in latency, in user trust, in tokens spent debugging). The model that gets it right on the first try the most often is the one we want, almost regardless of per-call cost.

Landed on Claude 3.5. The per-call price is higher; the retries-per-task ratio is lower; the net cost is similar to GPT-4o and the reliability ceiling is meaningfully higher.

The architectural prerequisite

This only works if your application code doesn't care which model serves a given call. We wired everything through the OpenAI Python SDK with a configurable base_url, then put a gateway in front that routes per request:

from openai import OpenAI

client = OpenAI(
    api_key="th-...",
    base_url="https://your-gateway/v1",
)

# Same client, different model per call
extraction = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[...],
)

summary = client.chat.completions.create(
    model="deepseek-chat",
    messages=[...],
)

agent_step = client.chat.completions.create(
    model="claude-3-5-sonnet",
    messages=[...],
)

Switching a workload to a different model is now a one-line config change in production.

Without that, picking three models means writing three integrations, debugging three sets of edge cases, and watching one of them rot every time a provider changes their API. With it, the marginal cost of trying a new model on a workload is near zero, which is also the marginal cost of moving off a model that stops being the right answer.

Why this matters more in 2026

The market keeps shipping new models at new price points. Last quarter alone: DeepSeek dropped V3 prices, Anthropic shipped a cheaper Haiku, OpenAI introduced 4o-mini, Mistral updated their open-weight lineup. Each of those releases changed which model was the right answer for at least one workload in our system.

If switching costs a sprint, you ignore most of those announcements. If it costs a config change, you re-evaluate routinely and your costs trend down while the market keeps shipping.

The shape of the question stays the same: which model is the best answer to this specific workload right now? You want code that can give a different answer next week without a sprint of refactoring.

TL;DR

Don't pick the best model. Pick the best model per workload, and keep picking — the answer changes every few weeks. The architectural prerequisite is wiring everything through one OpenAI-compatible API surface so swapping is free.

I used TokenHub for the gateway — 40+ models behind one API key, route per call. LiteLLM self-hosted is the same architectural pattern if you'd rather run it yourself. The point, again, is the pattern more than the vendor.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.