DEV Community

Srinivas Jayesh
Srinivas Jayesh

Posted on

How We Cut AI Inference Costs 6x With Runtime Model Routing

How We Cut AI Inference Costs 6x With Runtime Model Routing

Every query through the most powerful model. That was our default.

It was also burning money on problems that didn't need it.

Here's how we fixed it with runtime model routing — and what the numbers looked like after.

The Problem With One-Size-Fits-All Models

When you're building an AI agent, the easiest thing is to pick one model and use it for everything. GPT-4, Claude, Llama 70B — whatever feels most capable.

The problem: a P3 alert about stale search results doesn't need the same model as a P1 payment failure. Routing both through your most powerful model is like calling a surgeon to treat a papercut.

We needed intelligence in the routing layer itself.

What cascadeflow Does

We integrated cascadeflow — a runtime intelligence layer that decides which model handles each request based on what the request actually needs.

Setup is straightforward:

from cascadeflow import CascadeAgent, ModelConfig

models = [
    ModelConfig(name="llama-3.1-8b-instant", provider="groq", cost_per_token=0.0000001),
    ModelConfig(name="llama-3.3-70b-versatile", provider="groq", cost_per_token=0.0000008),
]
cascade = CascadeAgent(models=models, verbose=True)
Enter fullscreen mode Exit fullscreen mode

Two models. One cheap and fast. One powerful and expensive. cascadeflow decides which one handles each request.

The Routing Logic

We route based on incident severity:

def route_model(severity):
    if severity == "P1":
        model = "llama-3.3-70b-versatile"
        reason = "P1 incident — routing to powerful model"
    else:
        model = "llama-3.1-8b-instant"
        reason = "Low severity — routing to fast cheap model"
    print(f"[CASCADEFLOW] {reason}{model}")
    return model
Enter fullscreen mode Exit fullscreen mode

P1 incidents — payment failures, auth outages, data pipeline crashes — go to the powerful model. P2 and P3 incidents go to the fast cheap model.

The logic is simple. The savings are not.

The Numbers

After adding cascadeflow routing:

Severity Model Cost Per Query
P1 llama-3.3-70b-versatile $0.000271
P3 llama-3.1-8b-instant $0.000038

That's a 6x cost difference. On a system handling hundreds of alerts per day, that compounds fast.

And the quality on P3 incidents? Identical. A stale search index doesn't need a 70B parameter model to tell you to force a refresh.

What cascadeflow Logs

One of the most useful things cascadeflow gives you is visibility. Every routing decision is logged: [CASCADEFLOW] P1 incident — routing to powerful model → llama-3.3-70b-versatile

[CASCADEFLOW] Tokens: 339 | Cost: $0.000271 | Latency: 0.59s
[CASCADEFLOW] Low severity — routing to fast cheap model → llama-3.1-8b-instant

[CASCADEFLOW] Tokens: 398 | Cost: $0.000038 | Latency: 0.33s
You can see exactly which model handled each request, how many tokens it used, what it cost, and how long it took. That audit trail is invaluable for understanding where your budget is going.

Combining With Memory

We used cascadeflow alongside Hindsight for persistent agent memory. Hindsight stores every resolved incident as a memory. When a new alert fires, the agent recalls relevant past incidents as context.

The combination is powerful: memory makes the answers better, routing makes them cheaper. Together they make the agent production-ready.

Before and After

Before cascadeflow:
Every incident query goes through llama-3.3-70b-versatile. Cost per query: $0.000271. P3 alerts cost the same as P1. Budget burns fast.

After cascadeflow:
P1 incidents escalate to the powerful model. P2/P3 route to the fast cheap model. Average cost drops 6x. Budget goes further. Latency on low-severity alerts drops by 44%.

What I Learned

Route by complexity, not by default. Most queries don't need your best model. Defaulting to the most powerful option is a lazy decision that costs real money.

Visibility matters as much as routing. Knowing which model handled each request, at what cost, with what latency — that's the data you need to optimize further.

Start with severity, refine later. Severity-based routing is the simplest starting point. As you collect data, you can add complexity — token budget enforcement, quality thresholds, automatic escalation.

Free models go further with smart routing. We used Groq's free tier throughout. cascadeflow's routing meant we could stay within free tier limits while handling more queries.

Resources

Top comments (0)