ocean xu

Posted on Jul 1 • Originally published at oceanxu.hashnode.dev

Stop Using GPT-4o for Everything: A Developer's Guide to Model Routing

#ai #api #python #programming

Disclosure: I work on Barq, an API gateway for AI models. The benchmark tool mentioned is open source — you can run it yourself without signing up for anything.

I had a problem. Actually, a lot of developers have this problem. We pick one model — usually GPT-4o — and send every single request through it. Summaries, translations, code generation, chatbot responses, classification tasks. Doesn't matter. model="gpt-4o". Ship it.

Then the bill arrives.

1. The One-Model Trap: $180/Month for a Side Project

Let me show you what this looks like at a scale you can feel.

A side project with a few hundred DAU, serving ~500 AI chat conversations a day:

Task	% of requests	Tokens/day	Model	Cost/day (@$3.00/M)
Chat Q&A	40%	800K	GPT-4o	$2.40
Summarization	25%	500K	GPT-4o	$1.50
Code generation	15%	300K	GPT-4o	$0.90
Translation	10%	200K	GPT-4o	$0.60
Classification	10%	200K	GPT-4o	$0.60
Total	100%	2M	—	~$6.00/day

That's $180/month. For a side project. With no revenue.

Now here's the same workload, but routing each task to the right model:

Task	%	Tokens/day	Routed Model	Cost/day
Chat Q&A	40%	800K	DeepSeek V4 Pro	$0.52
Summarization	25%	500K	DeepSeek V4 Flash	$0.11
Code generation	15%	300K	DeepSeek V4 Pro	$0.20
Translation	10%	200K	Qwen 3.6 Plus	$0.24
Classification	10%	200K	DeepSeek V4 Flash	$0.04
Total	100%	2M	—	~$1.11/day

$33/month. The $147/month difference is a year of Vercel Pro. Or multiple .com domains. Or just money that stays in your pocket instead of OpenAI's.

This isn't theory. I benchmarked it. The quality difference on these task types? Negligible. I'll show you the data.

2. A Task Is Not a Task — The Capability Spectrum

Not all AI requests are created equal. Some need PhD-level reasoning. Some need "translate this button text to Arabic." Treating them the same is like using a cargo truck for grocery runs — it works, it's just expensive and unnecessary.

Here's my framework. Six task types, four models, three rounds of testing. Scores are out of 10 based on accuracy, relevance, and format compliance.

Task Type	DeepSeek V4 Pro	GPT-4o	Claude Sonnet 4.6	Gemini 3.1 Pro
Summarization (news articles)	8.7	9.0	8.9	8.3
Translation (EN→AR, EN→ZH)	8.2	8.8	8.0	8.5
Code generation (CRUD, regex)	9.1	9.2	8.8	8.0
Classification / sentiment	9.3	9.1	8.7	8.4
Creative writing	6.8	8.5	9.1	7.2
Multi-step agent chain	7.0	9.0	8.3	7.5

Now add cost to the picture:

Model	Price per 1M tokens (blended)
DeepSeek V4 Flash	$0.21
DeepSeek V4 Pro	$0.65
Qwen 3.6 Plus	$1.20
Gemini 3.1 Pro	$2.50
GPT-4o	$3.00
Claude Sonnet 4.6	$3.60

The pattern is clear: for summarization, classification, basic code generation, and translation, DeepSeek V4 Pro scores within 3-6% of GPT-4o while costing 78% less. For creative writing and complex agent chains, the premium models earn their price — the gap is real and I'm not going to pretend otherwise.

But here's the thing: 60-70% of a typical app's AI requests are the first kind. Simple, standardized tasks where model choice barely affects output quality. Those requests are bleeding your wallet dry.

3. The Routing Matrix — A Decision Table You Can Steal

I turned the benchmark data into a practical reference table. This isn't theoretical — it's what I use.

Task Type	Primary Model	Cost/1M	Fallback Model	Switch When...
Code generation	DeepSeek V4 Pro	$0.65	GPT-4o	Complex architecture design
Summarization	DeepSeek V4 Flash	$0.21	DeepSeek V4 Pro	>50K token context
Translation	Qwen 3.6 Plus	$1.20	GPT-4o	Legal/medical precision
Classification / sentiment	DeepSeek V4 Flash	$0.21	DeepSeek V4 Pro	Multi-label with nuanced categories
Creative writing	Claude Sonnet 4.6	$3.60	GPT-4o	Technical documentation
Agent chains	GPT-4o	$3.00	Claude Sonnet 4.6	Cost-sensitive batch jobs
RAG / embeddings	DeepSeek V4 Pro	$0.65	GPT-4o	Multilingual retrieval

A few notes from actually running this in production:

DeepSeek V4 Flash at $0.21/M tokens is absurdly good at structured output tasks. If your task is "classify this support ticket into one of 5 categories," don't even think about GPT-4o. Flash handles it just as well.
Qwen 3.6 Plus punches above its weight on translation, particularly EN↔AR and EN↔ZH. Better than Gemini, close to GPT-4o, at 60% less.
Claude Sonnet 4.6 is the creative writing king. If tone, voice, and style matter more than speed, it's worth every cent.

4. Implementation — 40 Lines of Python

Before I show the code, an honest admission: this router is 40 lines because the hard part is already handled.

Without a unified API layer, you'd need:

5 different Python SDKs (openai, anthropic, google-genai, plus custom HTTP clients for DeepSeek and Qwen)
5 API key rotation strategies
5 error-handling paths (each provider throws different exceptions)
5 billing dashboards to check when you're running low

That's easily 400+ lines of integration code before you write your first route rule. But if you're using an OpenAI-compatible unified endpoint, every provider collapses into one SDK, one key, one interface. The 40 lines handle routing logic. The platform handles everything else.

from openai import OpenAI

class ModelRouter:
    """
    40-line model router. Works because the API layer unifies:
    - Multi-provider auth (one key → all models)
    - SSE streaming compatibility
    - Error normalization across providers

    Without this unification layer: ~400 lines of per-provider boilerplate.
    """

    ROUTING_MAP = {
        "code_generation":   ("deepseek-v4-pro", "gpt-4o"),
        "summarization":     ("deepseek-v4-flash", "deepseek-v4-pro"),
        "translation":       ("qwen-3.6-plus", "gpt-4o"),
        "classification":    ("deepseek-v4-flash", "deepseek-v4-pro"),
        "creative_writing":  ("claude-sonnet-4.6", "gpt-4o"),
        "agent_chain":       ("gpt-4o", "claude-sonnet-4.6"),
        "rag":               ("deepseek-v4-pro", "gpt-4o"),
    }

    def __init__(self, api_key: str, base_url: str):
        self.client = OpenAI(api_key=api_key, base_url=base_url)

    def route(self, task_type: str, messages: list, **kwargs):
        primary, fallback = self.ROUTING_MAP.get(
            task_type, ("gpt-4o", "gpt-4o")
        )
        for model in [primary, fallback]:
            try:
                return self.client.chat.completions.create(
                    model=model,
                    messages=messages,
                    timeout=30,
                    **kwargs
                )
            except Exception:
                continue
        raise Exception("All models failed for this request.")


# Usage — one key, any model, same SDK:
router = ModelRouter(
    api_key="***",
    base_url="https://api.barqapi.com/v1"
)

# Route a code gen request → hits DeepSeek V4 Pro
code = router.route("code_generation", [
    {"role": "user", "content": "Write a Python function to parse ISO 8601 dates"}
])

# Route a summarization → hits DeepSeek V4 Flash ($0.21/M tokens)
summary = router.route("summarization", [
    {"role": "user", "content": "Summarize this article: ..."}
])

# Route a creative task → hits Claude Sonnet 4.6
story = router.route("creative_writing", [
    {"role": "user", "content": "Write a short story about a robot learning to garden"}
])

This is a starting point. A production version would add response quality validation, per-task timeout configs, structured logging, and probably a circuit breaker. But even this 40-line version saves 60-70% on API costs compared to sending everything to GPT-4o.

The principle: smart routing is not about the code — it's about knowing which model to use for which job. The code is the easy part. The benchmark data in the next section is what makes the routing decisions correct.

5. The Benchmark Data — Run It Yourself

I don't want you to trust my routing matrix. I want you to verify it.

I built a small CLI tool called barq-bench that runs the same 6 task types across 4 models and outputs a comparison table. It's open source and takes about 2 minutes to run:

npx barq-bench

Or clone and inspect:

git clone https://github.com/Barq-Api/barq-bench
cd barq-bench && npm install && npm start

It sends identical prompts to each model, evaluates the responses against a scoring rubric, and spits out a table. You can add your own tasks, your own models, your own evaluation criteria. The numbers in Section 2 came from running this on my machine.

If you get different results, tell me. The routing matrix should evolve as models improve and new ones launch. This is a living thing, not a static recommendation.

6. When NOT to Route — The Edge Cases

Routing saves money. Routing is not always the right call. Let me be specific about where it breaks.

6.1 The Prompt Tax

Swapping models isn't a pure drop-in replacement. Every model has quirks:

JSON mode inconsistency: GPT-4o will silently fix minor JSON formatting issues. Claude will throw a parse error. If your pipeline expects lenient JSON parsing, a model swap can break your downstream code.
System prompt behavior: DeepSeek V4 Pro follows system prompts more literally than GPT-4o. A prompt fine-tuned over months on GPT-4o might produce different tone or structure on another model.
Output length variance: Gemini 3.1 Pro interprets "be concise" differently. I've seen it generate 3x the output for the same "conciseness" prompt compared to GPT-4o.

Translation: if you've spent weeks fine-tuning a 200-line system prompt specifically for GPT-4o, don't expect it to work flawlessly on another model without adjustment. Route by task type, not by prompt. If your prompt is a work of art, keep it on the model it was crafted for.

6.2 Where Routing Works (and Where It Doesn't)

✅ Routing works well for:

Summarization — near-universal model agreement on what "summarize" means
Translation — standardized task with objective quality benchmarks
Basic classification / sentiment — deterministic, structured outputs
Simple code generation (CRUD, boilerplate, regex) — most modern models are competent
RAG augmentation — retrieval quality is more about your embeddings than your generation model

⚠️ Routing requires caution for:

Complex agent chains with multi-turn, nuanced system prompts
Creative writing where tone consistency matters across sessions
User-facing chat where response style consistency affects UX
Financial or medical compliance scenarios that mandate specific model certifications

6.3 When Routing Adds Complexity Without Enough Benefit

If your project spends less than $50/month on API calls, model routing might not be worth the cognitive overhead. Just use DeepSeek V4 Pro for everything — it's good enough for most tasks and costs less than a coffee. Routing pays off when your API bill hits triple digits.

6.4 Latency-Sensitive Workloads

Adding a routing decision adds ~50-100ms. If you're building real-time voice AI or a sub-200ms response time product, that overhead matters. In those cases, hardcode the fastest model and optimize for speed, not cost.

7. The Bigger Picture

In 2024, GPT-4 cost $30 per million tokens. In 2026, DeepSeek V4 Pro is $0.65. If this trend holds, by 2027 the cost of inference might not be a decision variable anymore.

But that doesn't make routing obsolete — it changes what routing optimizes for.

When every model is cheap, the differentiator isn't price. It's capability fit. Some models will be better at reasoning, some at creativity, some at following instructions precisely, some at handling non-English languages. Smart routing shifts from cost optimization to quality optimization — from "which model is cheapest" to "which model is best for this exact task."

Model routing today saves you money. Model routing tomorrow saves you from mediocrity. Start building that muscle now.

Further reading:

GPT API Pricing Comparison 2026 — a deeper dive into pricing across 13 providers
One-Line Fix for AI API Failover — what to do when your primary model goes down
barq-bench on GitHub — the benchmarking tool used in this article

I work on Barq, an API gateway that unifies AI model access. The benchmark tool is open source. Run it yourself.

DEV Community