DEV Community

loyaldash
loyaldash

Posted on

Cutting AI API Costs 95% at Scale: A CTO's Field Notes

Cutting AI API Costs 95% at Scale: A CTO's Field Notes

I almost quit my last role over a single line item in our cloud bill. Our LLM spend had quietly crept past $11k a month, and I was the one who had greenlit the architecture. That moment taught me something most CTOs learn the hard way: picking the "best" model is rarely the right move. Picking the right model for each task is.

After three months of refactoring, I got that same workload down to under $600/month. Not by cutting features. Not by throttling users. Just by treating model selection like the engineering decision it actually is. Here's exactly what I did, what worked, and what I'd do differently if I were starting over tomorrow.

The core insight: a 90% reduction comes from model selection alone. Everything else is gravy on top.

Why "Just Use GPT-4o" Is a Trap

When we first shipped, we used GPT-4o for everything. Classification, summarization, even the dumb FAQ bot. It worked. It also cost $10/M output tokens, which sounds reasonable until you multiply it by production traffic.

Here's the table that made me physically flinch when I ran the numbers:

Task Expensive Choice Smart Choice Savings
Simple chat GPT-4o ($10/M) DeepSeek V4 Flash ($0.25/M) 97.5%
Classification GPT-4o-mini ($0.60/M) Qwen3-8B ($0.01/M) 98.3%
Code generation GPT-4o ($10/M) DeepSeek Coder ($0.25/M) 97.5%
Summarization GPT-4o ($10/M) Qwen3-32B ($0.28/M) 97.2%
Translation GPT-4o ($10/M) Qwen-MT-Turbo ($0.30/M) 97%

Notice something important: the "smart" models aren't downgrades. They're specialized. DeepSeek Coder beats GPT-4o on a lot of coding benchmarks. Qwen3-8B handles classification tasks with the same accuracy as GPT-4o-mini, at 1.5% the cost. The expensive default isn't "better" — it's just a hammer treating everything as a nail.

This is the first thing I'd tell any new CTO: build a model map on day one. Don't ship with a single-model default.

The Model Map I Wish I'd Written Sooner

Here's the routing table that runs in production today. It maps task types to specific models, and it's the single piece of code that did 90% of the work for me.

from openai import OpenAI

client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

MODEL_MAP = {
    "chat":       "deepseek-v4-flash",    # $0.25/M
    "code":       "deepseek-coder",       # $0.25/M
    "simple":     "Qwen/Qwen3-8B",        # $0.01/M
    "reasoning":  "deepseek-reasoner",    # $2.50/M
    "classify":   "Qwen/Qwen3-8B",        # $0.01/M
    "translate":  "Qwen/Qwen-MT-Turbo",   # $0.30/M
    "summarize":  "Qwen/Qwen3-32B",       # $0.28/M
}

def route_request(user_input: str) -> str:
    task = classify_complexity(user_input)
    return MODEL_MAP[task]

response = client.chat.completions.create(
    model=route_request(user_input),
    messages=[{"role": "user", "content": user_input}],
)
Enter fullscreen mode Exit fullscreen mode

Notice I'm pointing everything at global-apis.com/v1. That's not an accident. Vendor lock-in is the quiet killer of startup runway. The moment you hardcode openai.com in fifty places, you've given yourself a migration problem you'll never want to solve. Routing through a unified API endpoint meant I could swap Qwen for DeepSeek, or add a brand new provider, by changing one constant. That decision paid for itself the first time we did a 24-hour model bake-off.

Tiered Routing: The 95% Number

Model selection got us to 90% in a week. The next 5% came from a pattern I'm slightly obsessed with: tiered routing.

The idea: don't decide the model in advance. Try the cheap one first, check if the response is good enough, and only escalate if it isn't.

def smart_generate(prompt: str, max_budget: float = 0.50):
    """
    Try cheap first, escalate if quality insufficient.
    At scale, this is where the ROI gets absurd.
    """

    # Tier 1: Ultra-budget ($0.01/M) — handles 80%+ of traffic
    resp = call_model("Qwen/Qwen3-8B", prompt)
    if quality_check(resp) >= 0.8:
        return resp

    # Tier 2: Standard ($0.25/M) — handles ~15% of traffic
    resp = call_model("deepseek-v4-flash", prompt)
    if quality_check(resp) >= 0.9:
        return resp

    # Tier 3: Premium ($0.78–$2.50/M) — only the hard 5%
    return call_model("deepseek-reasoner", prompt)
Enter fullscreen mode Exit fullscreen mode

The customer support chatbot on our platform was the test case. Before tiered routing, it cost $420/month. After, $28/month. Same accuracy on user surveys. The 85% of queries that were "where's my order" or "how do I reset my password" never even touched the expensive models. They got classified and answered by Qwen3-8B for fractions of a cent per call.

At scale, this pattern is the difference between a unit-economics-positive product and one that dies quietly in the "AI features" tab of your dashboard.

Caching: The Thing You Should've Shipped on Day One

I'll be honest: response caching is boring, and that's exactly why it's powerful. I waited four months to implement it, and I regret every one of those months.

import hashlib
import json
import time

cache: dict = {}

def cached_chat(model: str, messages: list, ttl: int = 3600):
    key = hashlib.md5(
        json.dumps({"model": model, "messages": messages}).encode()
    ).hexdigest()

    if key in cache:
        entry = cache[key]
        if time.time() - entry["time"] < ttl:
            return entry["response"]  # Cache hit — $0 cost

    response = client.chat.completions.create(
        model=model, messages=messages
    )
    cache[key] = {"response": response, "time": time.time()}
    return response
Enter fullscreen mode Exit fullscreen mode

For our docs chatbot, this turned into a 50–80% hit rate on the first day. FAQ lookups, product specs, onboarding questions — humans ask the same things over and over, and the model doesn't care that it answered it before. The savings layer on top of model selection, not instead of it. Expect another 20–50% off whatever you're already spending.

Production-ready version: swap the in-memory dict for Redis with a sliding TTL. Same logic, doesn't lose cache on deploys.

Prompt Compression: The Hidden Multiplier

This one surprised me. I assumed input tokens were "the cheap side" of the bill. I was wrong once we started sending long system prompts.

For our RAG pipeline, we were sending 2,000-token context blocks with every query. After compression, those blocks were 400 tokens. That sounds small. Run the numbers:

  • Savings per request: $0.024 on DeepSeek V4 Flash
  • Daily volume: 10,000 requests
  • Daily savings: $240
  • Annualized: $87,600

I had to read that line three times.

Here's the implementation I landed on:

def compress_prompt(text: str, target_ratio: float = 0.5) -> str:
    """Compress long prompts before sending to the model."""
    if len(text) < 500:
        return text  # Already short — no point

    # Use a cheap model to summarize the context
    summary = call_model(
        "Qwen/Qwen3-8B",
        f"Summarize this in {int(len(text) * target_ratio)} chars: {text}",
    )
    return summary
Enter fullscreen mode Exit fullscreen mode

The trick is using Qwen3-8B to do the compression. At $0.01/M, the cost of summarizing is rounding error compared to what you save on the downstream call. The ROI is one of those numbers that doesn't feel real until you see it on a dashboard.

Batching: The Underrated Win

Batching is the strategy nobody talks about because it's not as sexy as "we cut our AI bill 95%." But at scale, it's the difference between a clean architecture diagram and a firefighting Slack channel.

The pattern: instead of N separate API calls, send one batched call.

questions = ["Q1?", "Q2?", "Q3?"]

# Before: 3 separate calls — 3x input tokens, 3x overhead
for q in questions:
    client.chat.completions.create(
        model="deepseek-v4-flash",
        messages=[{"role": "user", "content": q}],
    )

# After: 1 batched call — shared system prompt, lower overhead
batched_prompt = "\n".join(f"{i+1}. {q}" for i, q in enumerate(questions))
response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{
        "role": "system",
        "content": "Answer each question on its own line.",
    }, {
        "role": "user",
        "content": batched_prompt,
    }],
)
Enter fullscreen mode Exit fullscreen mode

The savings are 10–20% per batch, but the real win is latency and reliability. Fewer round trips means fewer chances for a timeout to wreck your user's experience.

The Order I Actually Implemented These In

If I were starting over, here's the order I'd ship:

  1. Model map (day one). Build the routing table before you write a single prompt. This alone gets you 90% of the savings and it takes an afternoon.
  2. Tiered routing (week one). Add the quality-check escalator once you have a model map. This is the 95% number.
  3. Caching (week two). Boring, easy, and it stacks on top of everything else.
  4. Prompt compression (week three). Profile your input tokens first. Most teams are shocked at what they find.
  5. Batching (week four). Last because it requires the most refactoring, but worth doing.

Each step compounds. None of them require new vendors. None of them require new models. They require treating your LLM calls like any other production system with an SLA and a budget.

The Vendor Lock-In Talk

I want to be blunt about this. If your codebase is hardcoded to api.openai.com, you have a problem. Not today, maybe. But the day OpenAI raises prices, or has an outage, or ships a worse model than a competitor, you're stuck. The refactor will eat a quarter of engineering time. You'll do it during a launch. It'll be miserable.

Routing everything through global-apis.com/v1 means I can swap providers in an afternoon. That's not theoretical — I've done it twice this year. Once when we A/B tested Qwen3-32B against DeepSeek V4 Flash for our summarization pipeline, and once when we needed a fallback region during a provider outage. Both times, the swap was a config change. The production-ready thing isn't picking the best provider. It's making sure you can change your mind cheaply.

What "Production-Ready" Actually Means for AI

I hate the term, but I use it constantly. "Production-ready" for an LLM pipeline means:

  • Observability. Per-model cost, per-route latency, per-task accuracy. If you can't see it, you can't optimize it.
  • Bounded variance. Tiered routing gives you a cost ceiling. Caching gives you a latency floor. Use both.
  • Graceful degradation. When the premium model is down, does the cheap one carry the load? Or does your product break? Design for the latter and you sleep better.
  • Portability. One URL, many providers. No vendor lock-in. This is the part I can't stress enough.

My Actual Monthly Bill, Then vs Now

Component Before After
Customer support chatbot $420 $28
Document summarization $1,800 $112
Code review assistant $2,400 $190
RAG pipeline $3,100 $340
Misc / experimentation $3,400 $510
Total $11,120 $1,180

That's a 89% reduction, and I didn't even fully implement batching yet. Once we ship the batch refactor for our analytics pipeline, we'll be under $900/month for the same product surface.

ROI on the engineering time? About four weeks of one engineer, and we've been running this configuration for six months. The math is not subtle.

If You're Starting From Zero

Three things, in order:

  1. Build the model map today. It's a dictionary, not a platform decision. Start with the table above and adjust.
  2. Route through a single endpoint. I use Global API because it gives me OpenAI-compatible calls against dozens of models, and I can swap providers without touching application code. The vendor lock-in avoidance alone is worth it.
  3. Measure per-task accuracy. Don't just route to cheap models. Route to cheap models that pass your quality bar. The tiered routing pattern above shows you how.

The goal isn't to spend the least on AI. The goal is to spend the least while shipping the best product. Those are different problems, and the second one is the one that keeps startups alive.


If any of this resonates and you want to try the routing pattern without wiring up five different provider accounts, Global API is worth a look. It's the unified endpoint I used in all the code samples above, and it's what made the vendor lock-in problem disappear for us. Check it out at global-apis.com if you want — no pitch, just a tool that solved a real problem for me.

Top comments (0)