DEV Community

gentlenode
gentlenode

Posted on

How I Cut Our AI Bill by 65%: A CTO's Model Selection Playbook

How I Cut Our AI Bill by 65%: A CTO's Model Selection Playbook

Three months ago I opened our monthly infra invoice and nearly dropped my coffee. We were burning close to $18k/month on LLM API calls, and the worst part? Most of those calls were doing work that a model costing 1/10th the price could have handled. That invoice was the moment I stopped being a "just use GPT-4o for everything" engineer and started being a real CTO.

If you're shipping AI features in production, this post is for you. I'm going to walk you through exactly how I rethought our model strategy, the benchmark data that drove the decision, the architecture changes that made it production-ready, and the vendor lock-in playbook I use to sleep at night.

The Wake-Up Call

We'd launched a deep-dive research feature six months earlier. For the uninitiated, "deep dive" in our context means long-form analysis — we take a user query, run it through a multi-step reasoning pipeline, and produce a structured report. The workflow uses retrieval, summarization, and a final synthesis pass. Token volume adds up fast.

The first month? $4,200. Reasonable. We'd onboarded paying customers by month three, and that number had grown to $18k. Our pricing page hadn't changed. The usage pattern hadn't changed that dramatically. What changed was that I hadn't been paying attention to the per-token economics at scale.

Here's the math that hurt: GPT-4o at $2.50 per million input tokens and $10.00 per million output tokens sounds cheap when you're prototyping. When you're processing 800M output tokens a month, it stops being cheap. It becomes a second salary.

That's when I started shopping around.

Mapping the Landscape (Without Locking In)

I'm paranoid about vendor lock-in. Anyone who lived through the AWS S3 outage of 2017, or watched their favorite cloud provider raise prices 30% with 90 days' notice, knows the feeling. Single-vendor dependency is a tax you pay in both money and optionality.

So I needed a unified API surface that gave me access to multiple model providers without forcing me to rewrite integration code every time I wanted to A/B test. Global API fits that bill — they expose 184 models through a single OpenAI-compatible endpoint. Same SDK, same interface, swap the model string and you're done.

Here are the five models that made my shortlist after two weeks of evaluation:

Model Input ($/M) Output ($/M) Context
DeepSeek V4 Flash 0.27 1.10 128K
DeepSeek V4 Pro 0.55 2.20 200K
Qwen3-32B 0.30 1.20 32K
GLM-4 Plus 0.20 0.80 128K
GPT-4o 2.50 10.00 128K

The pricing spread is wild. The cheapest model in the catalog is $0.01 per million tokens. The most expensive is $3.50. That's a 350x range, and it tells you everything you need to know about the assumption that "all LLMs are roughly the same price." They're not. They are wildly different commodities, and treating them as interchangeable is leaving money on the table.

The Benchmark That Changed My Mind

I'd been guilty of the same mistake I see in every startup CTO's first AI architecture doc: picking the model with the best vibes. "GPT-4o is the safe choice. Everyone knows it. We'll just use that." That's not architecture. That's vibes.

I ran a proper evaluation suite. Took two weeks off and on, but I ran our actual production prompts through each model on the shortlist and graded them on a quality rubric our PM team built. The result: 84.6% average benchmark score across the top five contenders, with the cheaper models scoring within 1-2 points of GPT-4o on the tasks that actually mattered for our deep-dive feature.

The 40-65% cost reduction number in our case study isn't marketing fluff. It's what fell out of the spreadsheet when I plugged in our real workload.

Let me show you what the actual integration looks like, because if it's painful, none of this matters:

import openai
import os

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[
        {"role": "system", "content": "You are a research analyst..."},
        {"role": "user", "content": "Analyze the following market data..."}
    ],
    temperature=0.3,
    max_tokens=4000,
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

That's it. That's the migration. Drop in the new base URL, change the model string, point the API key at the new provider. If you've integrated with the OpenAI SDK once, you've integrated with every model in the catalog. The unified interface is what makes fast iteration possible — I can A/B test three different models in a single afternoon without touching infrastructure.

Production Architecture: Routing, Not Picking

Here's the architecture decision that actually moved the needle: stop picking one model. Start routing.

Not every prompt is created equal. A simple classification call doesn't need a 200K context window and frontier reasoning. A multi-document synthesis does. Building a router that matches prompt complexity to model tier cut our bill by another 20% on top of the model swap.

Here's a simplified version of the routing layer we ended up shipping:

def route_request(prompt, estimated_tokens, task_type):
    if task_type == "classification" or estimated_tokens < 500:
        return "deepseek-ai/DeepSeek-V4-Flash", 0.3
    elif task_type == "summarization" and estimated_tokens < 8000:
        # Mid-tier — good quality, reasonable price
        return "Qwen3-32B", 0.5
    elif task_type == "deep_dive":
        # The big guns — only when the task earns it
        return "deepseek-ai/DeepSeek-V4-Pro", 0.7
    else:
        # Default fallback
        return "GLM-4 Plus", 0.5

def call_llm(prompt, task_type, estimated_tokens):
    model, temp = route_request(prompt, estimated_tokens, task_type)
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

That routing function is the single most valuable piece of code I shipped last quarter. It's not sophisticated. It's not even particularly clever. But it forces us to make conscious decisions about cost vs. quality on a per-request basis, and it gives me a clean place to plug in new models as they emerge.

The throughput numbers that fell out of this architecture: 1.2 seconds average latency, 320 tokens per second sustained. Production-ready isn't a slide deck word in my world — it's a property of the system you can measure on a Tuesday morning.

Caching and Streaming: The Boring Wins

Everyone wants to talk about model selection. Nobody wants to talk about the 40% hit rate I get from prompt caching, or how streaming responses changed our user-perceived latency profile. These are the unglamorous wins that compound.

Caching is simple: if a user asks the same question twice, don't pay for it twice. We use a Redis layer in front of the model calls with a semantic similarity threshold. 40% of our deep-dive requests now hit cache. That's not 40% off our entire bill — cache hits are still served through our infrastructure — but it's 40% off the LLM line item, which is the only one that scales unboundedly with usage.

Streaming is even simpler. Flip a parameter, get tokens as they're generated. The user sees the first token in 200ms instead of waiting 1.2 seconds for the full response. Time-to-first-token is a UX metric, but it also doubles as a perceived-performance metric that affects retention. I'm not above gaming the human perception of speed if it keeps churn down.

The Vendor Lock-In Playbook

Let me talk about the thing that keeps me up at night, because it should keep you up too. Vendor lock-in is the silent killer of startup AI budgets. You pick a provider, you build tooling around their SDK quirks, you train your team on their dashboard, you build eval pipelines against their specific output formats, and then they raise prices. Or they deprecate the model you depend on. Or they have an outage at the worst possible moment.

The mitigation strategy I use has three layers:

Layer 1: Abstract the call site. Every model interaction in our codebase goes through one function — call_llm(prompt, ...) — and that function resolves to a provider at runtime. If I want to switch providers for an entire feature, I change one config flag. No code changes. No redeploys. No sprint planning.

Layer 2: Maintain eval parity. I run the same evaluation suite against every model on the shortlist every month. If a model degrades, I know. If a new model ships that beats my current default, I know. Eval parity means I can switch on a Tuesday and ship on a Wednesday.

Layer 3: Multi-provider redundancy. For the 10% of requests that absolutely, positively cannot fail, I run a primary and a fallback. The fallback is always a different provider, ideally on a different cloud, ideally with different geopolitical risk exposure. Yes, this is overkill for most startups. No, I don't care. I've been burned before.

Global API makes all three layers easier because the interface is the same regardless of which model sits behind it. I can A/B test two providers in production with a 50/50 traffic split, and the only thing I have to change is the model string in my routing function.

What I Track

If you can't measure it, you can't optimize it. Here's what I look at every Monday:

  • Cost per deep-dive report: my north star. Down 62% since the architecture change.
  • P95 latency by model tier: ensures I'm not trading cost for user pain.
  • Cache hit rate by prompt category: tells me where to invest in better caching.
  • Quality scores by model: my eval pipeline output. Drift is a leading indicator.
  • Provider error rates: 429s, 500s, and the like. If a provider starts misbehaving, I want to know before the dashboard tells me.

The ROI on this kind of instrumentation is enormous. The first month we tracked these metrics, we found a single misconfigured batch job that was sending full-context-window prompts to GPT-4o when a 4K-context model would have been fine. That one bug was costing us $2,400/month. One config change. Twenty minutes of work. Two-thousand-four-hundred dollars a month, every month, for the life of the product.

The Speed of Iteration Thing

I want to talk about velocity for a second, because it's the part of "production-ready" that nobody puts on a slide. When I can swap model providers in an afternoon — and I can, because the API surface is stable across all

Top comments (0)