Using GPT-4o-mini for Simple Tasks and GPT-4o for Complex Ones - Automatically

#python #ai #agents #llm

You are paying gpt-4o prices for tasks gpt-4o-mini handles just as well. If your application sends every request to your most capable model, you are not being safe - you are leaving money on the table and paying a reliability tax for headroom you rarely need.

This post shows how to use gpt-4o-mini for simple tasks and gpt-4o for complex ones automatically, with three working approaches ranked by sophistication.

The Cost Math

First, the numbers. As of early 2025:

gpt-4o-mini: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens
gpt-4o: ~$2.50 per 1M input tokens, ~$10.00 per 1M output tokens

That is roughly a 15-17x difference on input, and a 16-17x difference on output.

Now model a realistic workload:

# Classification task: label an email as spam/not-spam
classification_input_tokens = 200
classification_output_tokens = 10

# Synthesis task: summarize a 10-page document into executive memo
synthesis_input_tokens = 2000
synthesis_output_tokens = 400

# Cost per request (in dollars)
mini_input_rate = 0.15 / 1_000_000
mini_output_rate = 0.60 / 1_000_000
gpt4o_input_rate = 2.50 / 1_000_000
gpt4o_output_rate = 10.00 / 1_000_000

classification_cost_mini = (classification_input_tokens * mini_input_rate) + (classification_output_tokens * mini_output_rate)
classification_cost_gpt4o = (classification_input_tokens * gpt4o_input_rate) + (classification_output_tokens * gpt4o_output_rate)

synthesis_cost_mini = (synthesis_input_tokens * mini_input_rate) + (synthesis_output_tokens * mini_output_rate)
synthesis_cost_gpt4o = (synthesis_input_tokens * gpt4o_input_rate) + (synthesis_output_tokens * gpt4o_output_rate)

print(f"Classification - mini: ${classification_cost_mini:.6f}, gpt-4o: ${classification_cost_gpt4o:.6f}")
print(f"Synthesis      - mini: ${synthesis_cost_mini:.6f}, gpt-4o: ${synthesis_cost_gpt4o:.6f}")
print(f"Ratio (synthesis gpt-4o vs classification mini): {synthesis_cost_gpt4o / classification_cost_mini:.1f}x")

Output:

Classification - mini: $0.000036, gpt-4o: $0.000525
Synthesis      - mini: $0.000540, gpt-4o: $0.005500
Ratio (synthesis gpt-4o vs classification mini): 152.8x

The gap between "cheap model, cheap task" and "expensive model, expensive task" is over 150x. The gap between sending a classification task to gpt-4o vs gpt-4o-mini is about 14x. At scale, that is not rounding error.

Approach 1: Rule-Based Heuristics

The simplest approach. Inspect the request before sending it, and route based on observable properties.

from openai import OpenAI

client = OpenAI()

def classify_complexity(prompt: str, task_type: str = "general") -> str:
    """Returns 'simple' or 'complex' based on heuristics."""

    simple_signals = 0

    # Short input
    if len(prompt.split()) < 100:
        simple_signals += 1

    # Extractive task types
    if task_type in ("classification", "extraction", "yes_no", "label"):
        simple_signals += 2

    # Short expected output (keywords suggest brief answers)
    brief_keywords = ["classify", "label", "extract", "identify", "is this", "yes or no", "true or false"]
    if any(kw in prompt.lower() for kw in brief_keywords):
        simple_signals += 1

    # No multi-step reasoning required
    reasoning_keywords = ["analyze", "synthesize", "compare", "evaluate", "explain why", "write a", "generate"]
    if not any(kw in prompt.lower() for kw in reasoning_keywords):
        simple_signals += 1

    return "simple" if simple_signals >= 3 else "complex"


def route_completion(prompt: str, task_type: str = "general", **kwargs):
    complexity = classify_complexity(prompt, task_type)
    model = "gpt-4o-mini" if complexity == "simple" else "gpt-4o"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )

    return response, model


# Example
response, model_used = route_completion(
    "Classify this email as spam or not spam: 'You won $1,000,000! Click here!'",
    task_type="classification"
)
print(f"Used: {model_used}")
print(response.choices[0].message.content)

This works and costs nothing extra. The downside: you write the rules at deploy time, and they encode your assumptions. When traffic patterns shift, the rules do not.

Approach 2: Lightweight Classifier Call

Use a cheap model to judge whether the task needs the expensive model. The classifier call itself costs almost nothing.

import json
from openai import OpenAI

client = OpenAI()

CLASSIFIER_PROMPT = """You are a task complexity classifier. Given a user prompt, determine whether it requires:
- SIMPLE: Short, factual, extractive, or classification tasks. Single-step. Verifiable output.
- COMPLEX: Multi-step reasoning, synthesis, generation, analysis, or tasks requiring deep knowledge.

Respond with JSON only: {"complexity": "SIMPLE" | "COMPLEX", "reason": "one sentence"}"""


def classify_with_llm(user_prompt: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Always use cheap model to classify
        messages=[
            {"role": "system", "content": CLASSIFIER_PROMPT},
            {"role": "user", "content": user_prompt}
        ],
        response_format={"type": "json_object"},
        max_tokens=100
    )
    return json.loads(response.choices[0].message.content)


def route_with_classifier(prompt: str, **kwargs):
    classification = classify_with_llm(prompt)

    if classification["complexity"] == "SIMPLE":
        model = "gpt-4o-mini"
    else:
        model = "gpt-4o"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )

    return response, model, classification


# Example
response, model_used, classification = route_with_classifier(
    "Write a detailed analysis of how transformer attention mechanisms enable in-context learning, "
    "with specific reference to the induction head hypothesis."
)
print(f"Classification: {classification}")
print(f"Used: {model_used}")

The economics work as long as your classifier saves more than it costs. A gpt-4o-mini classification call at 100 tokens costs roughly $0.000015. If it correctly routes one request away from gpt-4o (saving ~$0.005), it pays for itself 333 times over.

The problem: the classifier is also static. It learns nothing from whether your users were actually satisfied with the routed response.

Approach 3: Outcome-Based Routing with Thompson Sampling

The most robust approach. Instead of encoding rules or running a classifier, the router observes what actually works and shifts allocations based on real outcomes.

Thompson Sampling is a Bayesian bandit algorithm. Each model gets a Beta distribution representing its estimated success rate. The router samples from those distributions and picks the model that looks most promising - balancing exploitation (use what works) with exploration (try the other option occasionally to keep learning).

The key difference from approaches 1 and 2: the router updates its beliefs every time you report an outcome. It learns your specific workload.

Using Kalibr for Approach 3

Kalibr implements Thompson Sampling routing out of the box. You define your models, set a success condition, and call router.completion(). The SDK handles the rest.

import kalibr  # Must import first
from kalibr import Router

router = Router(
    paths=[
        {"model": "openai/gpt-4o-mini", "weight": 0.8},
        {"model": "openai/gpt-4o",      "weight": 0.2},
    ],
    success_when="response.choices[0].finish_reason == 'stop' and len(response.choices[0].message.content) > 10",
    goal_id="email_classification"
)

def classify_email(email_body: str) -> str:
    response = router.completion(
        messages=[
            {"role": "system", "content": "Classify as SPAM or NOT_SPAM. Reply with one word only."},
            {"role": "user", "content": email_body}
        ]
    )
    return response.choices[0].message.content.strip()


result = classify_email("Congratulations! You've been selected for a free iPhone. Claim now!")
print(result)  # SPAM

The starting weights (0.8 mini, 0.2 gpt-4o`) reflect your prior belief that mini handles most cases. As you accumulate outcomes, Thompson Sampling shifts those weights based on actual success rates per model.

You can also report explicit quality signals:

`python
from kalibr import Router, Outcome

router = Router(
paths=[
{"model": "openai/gpt-4o-mini"},
{"model": "openai/gpt-4o"},
],
goal_id="customer_support_routing"
)

def handle_support_ticket(ticket: str, user_id: str) -> dict:
response, request_id = router.completion(
messages=[{"role": "user", "content": ticket}],
return_request_id=True
)

answer = response.choices[0].message.content

# Report outcome based on your quality check
# (e.g., did the customer escalate? did they mark resolved?)
router.report_outcome(
    request_id=request_id,
    outcome=Outcome.SUCCESS  # or Outcome.FAILURE
)

return {"answer": answer, "request_id": request_id}

Comparing the Three Approaches

Approach	Setup cost	Adapts over time	Extra latency	Requires labeling
Rule-based heuristics	Low	No	None	No
Classifier call	Medium	No	~200ms	No
Thompson Sampling (Kalibr)	Low	Yes	None	Optional

For a greenfield system with no traffic data, start with heuristics. They are good enough and cost nothing.

For a system with clear task types and a budget for a second API call, the classifier approach is more accurate and still simple.

For production systems where you care about long-term cost efficiency and your traffic mix changes over time, outcome-based routing with Thompson Sampling is the right answer. It requires no rule maintenance and gets better with use.

Summary

Using gpt-4o-mini for simple tasks and gpt-4o for complex ones automatically is not a one-time config change. It is a routing problem. The tools exist to solve it properly:

Heuristics if you want something working in 30 minutes
Classifier call if your task types are well-defined and stable
Thompson Sampling via Kalibr if you want the router to learn and maintain itself

The cost difference between getting this right and sending everything to gpt-4o is real. At 10,000 requests per day, the gap between full gpt-4o and a well-routed mix is often $200-500/month or more - and it compounds as you scale.