DEV Community

eagerspark
eagerspark

Posted on

How I Cut Our AI API Bill by 95% — A CTO's Field Notes

How I Cut Our AI API Bill by 95% — A CTO's Field Notes

Last quarter I opened our infra bill and nearly choked. We were burning through AI tokens like a hedge fund burns through Series A cash — casually, and without anyone really tracking it. I'm a CTO at a small startup, which means I'm also the person who writes the code, picks the models, signs the bills, and explains to the board why our gross margin looks like Swiss cheese.

This is the playbook I wish someone had handed me six months earlier. Every number below is real. Every line of code is running in production. And if you're a startup founder or eng lead reading this, my hope is that you skip the expensive learning curve I tripped through.

The $14,000 Mistake That Started Everything

Our initial AI stack looked like everyone else's in 2024: GPT-4o for everything. Customer support? GPT-4o. Internal summarization? GPT-4o. Document classification? You guessed it — GPT-4o. We told ourselves we'd "optimize later" once we had product-market fit.

Two months in, our invoice hit $14,000. Product-market fit was still a rumor. That's when I sat down with our usage logs and realised we were running thousands of trivial requests through the most expensive model on the market. It was like using a Lamborghini to fetch groceries.

The fix wasn't fancy ML engineering. It was architecture. And the ROI was immediate — within three weeks our bill dropped to under $700/month for the same product surface. That's the 95% number I'll keep referencing. Let me walk you through the actual stack.

Pick Your Battles: Model Selection Is Your Biggest Lever

When I tell founders "model selection is the single biggest cost lever," they nod politely and then keep using GPT-4o for everything. I get it — the convenience factor is real. But at scale, "convenient" becomes "bankrupting."

Here's the exact map I built for our system. These aren't theoretical comparisons; they're the production assignments I committed to after weeks of benchmarking:

Task type What we used to use What we use now Savings
Simple chat GPT-4o at $10/M DeepSeek V4 Flash at $0.25/M 97.5%
Classification GPT-4o-mini at $0.60/M Qwen3-8B at $0.01/M 98.3%
Code generation GPT-4o at $10/M DeepSeek Coder at $0.25/M 97.5%
Summarization GPT-4o at $10/M Qwen3-32B at $0.28/M 97.2%
Translation GPT-4o at $10/M Qwen-MT-Turbo at $0.30/M 97%

The pattern: every "smart" choice here is 30-1000× cheaper than the convenient default. When you're processing millions of tokens, those ratios compound into real cash.

One thing I want to flag: this isn't just about cost. Vendor lock-in is a real risk. If I'd built everything around GPT-4o APIs, I'd be one pricing change away from a margin crisis. Spreading inference across multiple providers — DeepSeek for code, Qwen for translation, with a premium model in reserve — gives us negotiating leverage and resilience. If one provider raises prices or has an outage, we reroute in hours, not weeks.

Tiered Routing: The 85/15/5 Split That Changed Everything

Once you accept that not every request needs the same brain, the next architecture decision is routing. We settled on a three-tier system that I'll show you in code shortly, but first the philosophy:

  • Tier 1 handles the long tail of easy queries. Around 80% of traffic.
  • Tier 2 handles the medium stuff. Maybe 15% of traffic.
  • Tier 3 is reserved for the genuinely hard 5%.

I built this as a wrapper around the OpenAI-compatible API. We use Global API as our routing layer because it gives us one endpoint for many models — less vendor lock-in, one bill, one place to monitor. Here's the actual code that powers our production routing:

import httpx
import hashlib
import json
import time

BASE_URL = "https://global-apis.com/v1"
API_KEY = os.environ["GLOBAL_API_KEY"]

MODEL_MAP = {
    "ultra_budget": "Qwen/Qwen3-8B",        # $0.01/M
    "standard":     "deepseek-v4-flash",    # $0.25/M
    "premium":      "deepseek-reasoner",    # $2.50/M
}

def call_model(model, messages, max_tokens=512):
    response = httpx.post(
        f"{BASE_URL}/chat/completions",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "model": model,
            "messages": messages,
            "max_tokens": max_tokens,
        },
        timeout=30.0,
    )
    response.raise_for_status()
    return response.json()

def quality_check(response, threshold=0.8):
    """Heuristic: short, low-confidence answers get escalated."""
    text = response["choices"][0]["message"]["content"]
    if not text or len(text) < 5:
        return 0.0
    return min(1.0, len(text) / 200)

def smart_generate(prompt, max_budget_per_call=0.50):
    messages = [{"role": "user", "content": prompt}]

    resp = call_model(MODEL_MAP["ultra_budget"], messages)
    if quality_check(resp) >= 0.8:
        return resp

    # Tier 2 — standard tier
    resp = call_model(MODEL_MAP["standard"], messages)
    if quality_check(resp) >= 0.9:
        return resp

    # Tier 3 — premium, only the genuinely hard stuff
    return call_model(MODEL_MAP["premium"], messages, max_tokens=2048)
Enter fullscreen mode Exit fullscreen mode

The real-world result: our customer support chatbot went from $420/month to $28/month. Same product, same users, same answer quality (we ran blind evals). That's a 93% reduction just from routing, before we applied anything else I'm about to show you.

Caching: The Free Lunch Nobody Takes

Caching is one of those things every senior engineer knows they should do, and somehow almost nobody actually implements at scale. The reason is laziness — it's easier to fire off another API call than to think about whether you've already seen this prompt. But at scale, laziness is expensive.

Our setup is deliberately simple. We hash the prompt + model combination and store the response. If we see the same hash within the TTL window, we return the cached response at zero cost. Here's the production version:

import hashlib
import json
import time

_cache = {}

def cached_call(model, messages, ttl=3600):
    key = hashlib.md5(
        json.dumps({"model": model, "messages": messages}, sort_keys=True).encode()
    ).hexdigest()

    entry = _cache.get(key)
    if entry and time.time() - entry["time"] < ttl:
        return entry["response"]  # cache hit — zero API cost

    response = call_model(model, messages)
    _cache[key] = {"response": response, "time": time.time()}
    return response
Enter fullscreen mode Exit fullscreen mode

For us, the win was massive because support queries repeat. "How do I reset my password?" gets asked thousands of times. "What's your refund policy?" same. Once we cached common queries, we hit cache rates between 50% and 80% depending on the surface area. That's effectively cutting our API spend in half again on top of the model routing savings.

A word of caution: caching only works if you have predictable, repeatable traffic. If every prompt is unique (creative generation, personalized recommendations), this won't help. But for anything FAQ-shaped, document Q&A, or classification-heavy, it's production-ready and basically free.

Prompt Compression: The Hidden Token Tax

Input tokens are the silent killer. Most teams optimize their prompts for clarity and completeness, which is the right call for quality — but devastating for cost when those prompts get sent millions of times.

The trick: use a cheap model to compress long context before sending it to an expensive one. We use Qwen3-8B at $0.01/M to summarize long documents, then send the summary to whichever model actually does the work. Here's the function:

def compress_prompt(text, target_ratio=0.5):
    if len(text) < 500:
        return text  # already short — don't waste a call

    instruction = (
        f"Summarize the following text in roughly "
        f"{int(len(text) * target_ratio)} characters, "
        f"preserving all key facts and entities:\n\n{text}"
    )
    summary = call_model(
        MODEL_MAP["ultra_budget"],
        [{"role": "user", "content": instruction}],
    )
    return summary["choices"][0]["message"]["content"]
Enter fullscreen mode Exit fullscreen mode

Let's do the math that made me a believer. We had a 2,000-token system prompt that we were prepending to every single request. After compression, it became 400 tokens. At $0.25/M (DeepSeek V4 Flash rates), that's a saving of $0.024 per request. Sounds tiny until you multiply it: 10,000 requests/day = $240/day = $87,600/year.

Eighty-seven thousand dollars a year. From one prompt. That was the moment I stopped treating input tokens as "free" and started treating them as the largest line item on our AI invoice.

Batching: The Engineering Instinct That Pays Off

The last lever is one I borrowed from traditional systems engineering: don't make three API calls when one will do. Every API call has fixed overhead — request setup, response parsing, network latency. When you're sending a list of similar prompts, batching them into one call saves tokens AND reduces overhead.

The contrast is straightforward:

# ❌ Before — three calls, three overheads
for question in user_questions:
    response = call_model(
        "deepseek-v4-flash",
        [{"role": "user", "content": question}],
    )
    answers.append(parse_answer(response))

# ✅ After — one call, one overhead, one bill
batch_prompt = (
    "Answer each question below on its own line, "
    "numbered to match the question:\n\n"
    + "\n".join(f"{i+1}. {q}" for i, q in enumerate(user_questions))
)
response = call_model(
    "deepseek-v4-flash",
    [{"role": "user", "content": batch_prompt}],
)
answers = parse_numbered_answers(response)
Enter fullscreen mode Exit fullscreen mode

We've seen 10-20% savings on batched workloads, and the throughput improvement is even bigger because we're not waiting on sequential network calls. For our nightly data processing jobs, this cut wall-clock time in half. Free velocity.

The Production-Ready Checklist I Wish I'd Had

Let me consolidate what I learned into a checklist for anyone building AI infrastructure at a startup:

  1. Instrument your costs per feature, not just per model. If you can't tell which product surface is spending $X, you can't optimize it.
  2. Default to cheap models. Only escalate when quality evals prove you need to. The cost of "we picked the expensive one because it felt safer" is enormous at scale.
  3. Build a routing layer, not a hardcoded model call. Every AI feature should go through a function that decides which model to use. This is your future flexibility.
  4. Cache aggressively, expire carefully. A 1-hour TTL on common queries is a no-brainer. Longer TTLs need freshness guarantees.
  5. Compress before you send. Run a cheap summarizer over long context. The savings dwarf the cost of the compression call.
  6. Batch everything that can be batched. Sequential loops of model calls are usually a code smell.
  7. Avoid vendor lock-in. Use an OpenAI-compatible aggregator (we use Global API) so you can swap providers in hours, not weeks. This single decision saved us during a major provider outage last month.
  8. Measure quality, not just cost. Every time you swap a model, run a blind eval. A 95% cost reduction that breaks user experience is not a win.

What This Actually Looks Like in Dollars

Let me put it all together. Our monthly AI spend before optimization: roughly $14,000. After all five levers applied: around $650. The customer support chatbot specifically went from $420 to $28. Document summarization dropped 92%. Code review tooling dropped 96%.

Total monthly savings: ~$13,350. Annualized: $160,200. For a 4-person startup, that's the difference between hiring our next engineer and not.

And here's the part that doesn't show up in the invoice: we now ship faster. When I tell my team "use the cheap model first," they iterate faster because they're not anxious about every test run costing real money. Cost-effectiveness and velocity are the same conversation, not opposing ones.

My Closing Pitch

If you take one thing from this field report, take this: most AI API bills are a reflection of architectural choices, not model prices. The models are getting cheaper every quarter. Your architecture choices are the thing you control today.

If you want a fast way to test these ideas without committing to ten different provider relationships, take a look at Global API. Their endpoint at global-apis.com/v1 speaks the OpenAI protocol, so the code I showed above drops into most existing stacks with zero changes. That's been our router for the past quarter and it's how we keep our options open without sacrificing developer ergonomics.

The best part of optimizing AI costs isn't the savings — it's the freedom. When each query costs fractions of a cent, you stop rationing experiments. You start building the product you actually want to build. That's the real ROI, and it's the reason I keep preaching this stuff to every founder who'll listen.

Now go rewrite that client.chat.completions.create call. Your future self — and your CFO — will thank you.

Top comments (0)