gentleforge

Posted on Jun 13

How I Cut Our AI API Bill by 95% — A Practical Guide for 2026

#webdev #programming #tutorial #deepseek

I'll be honest with you: when I first looked at our monthly AI bill, I almost choked on my coffee. We were burning through thousands of dollars every month on a product that, frankly, didn't need to cost that much. The worst part? I was the one who approved the architecture.

That wake-up call sent me down a rabbit hole. Over the next six months, I rebuilt our LLM layer from scratch, dropped our costs by about 95%, and — surprise — actually improved quality in several places. This is everything I learned, written the way I wish someone had written it for me.

If you're a CTO or technical founder running AI features at any kind of scale, this should save you real money.

The Audit Nobody Wants to Do

Before you touch anything, you need to know what you're actually spending money on. I started by tagging every single LLM call in our system with three metadata fields: model used, input token count, output token count, and a human-readable task label ("classification", "summarization", "chat", etc.).

What jumped out was brutal. Roughly 70% of our requests were simple tasks — intent classification, FAQ answering, basic entity extraction — running through GPT-4o at $10.00/M output tokens. We were paying premium prices for work a $0.01/M model could do in its sleep.

That visibility is non-negotiable. You can't optimise what you can't measure. Build the instrumentation first. Everything else flows from here.

Match the Model to the Job

This is the single biggest lever, and the one with the highest ROI. The "convenient" default model is almost always the wrong one for most calls in your pipeline.

Here's the matrix I landed on after weeks of benchmarking:

Task	What We Were Using	What We Switched To	Savings
Simple chat	GPT-4o ($10.00/M)	DeepSeek V4 Flash ($0.25/M)	97.5%
Classification	GPT-4o-mini ($0.60/M)	Qwen3-8B ($0.01/M)	98.3%
Code generation	GPT-4o ($10.00/M)	DeepSeek Coder ($0.25/M)	97.5%
Summarization	GPT-4o ($10.00/M)	Qwen3-32B ($0.28/M)	97.2%
Translation	GPT-4o ($10.00/M)	Qwen-MT-Turbo ($0.30/M)	97%

Just routing the right request to the right model — no fancy infrastructure changes — saves around 90% on the affected workloads. That's the floor. Everything below is upside.

The trick was wrapping our LLM client so that every call routes through a routing function based on task type. Here's the simplified version:

import requests

API_BASE = "https://global-apis.com/v1"
API_KEY = "your-api-key"

MODEL_MAP = {
    "chat": "deepseek-v4-flash",          # $0.25/M
    "code": "deepseek-coder",             # $0.25/M
    "classification": "Qwen/Qwen3-8B",    # $0.01/M
    "summarization": "Qwen/Qwen3-32B",    # $0.28/M
    "translation": "Qwen-MT-Turbo",       # $0.30/M
    "reasoning": "deepseek-reasoner",     # $2.50/M
}

def route_call(task_type, user_input):
    model = MODEL_MAP[task_type]
    response = requests.post(
        f"{API_BASE}/chat/completions",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "model": model,
            "messages": [{"role": "user", "content": user_input}]
        }
    )
    return response.json()

Notice what I'm doing here: I'm hitting a unified endpoint instead of pinning to one provider. That's the architectural decision that protects you from vendor lock-in. More on that in a minute.

Tiered Routing: Cheap First, Escalate Only When Needed

Once you have multiple models in play, the next move is to stop guessing which one to use and start letting the system figure it out. The pattern I landed on is "try cheap, escalate on failure."

Roughly 80% of our traffic now handles cleanly on a $0.01/M model. About 15% escalates to a $0.25/M model. The remaining 5% — the genuinely hard stuff — gets the premium reasoning model at $2.50/M. The result for our customer support chatbot was a drop from $420/month to $28/month, and our CSAT scores actually went up because responses got faster.

Here's the production-ready version of that router:

def smart_generate(prompt, max_budget=0.50):
    """Try cheap first, escalate if quality insufficient"""

    resp = call_model("Qwen/Qwen3-8B", prompt)
    if quality_check(resp) >= 0.8:
        return resp  # 80%+ of requests handled here

    # Tier 2: Standard ($0.25/M)
    resp = call_model("deepseek-v4-flash", prompt)
    if quality_check(resp) >= 0.9:
        return resp  # 15% of requests

    # Tier 3: Premium ($0.78-$2.50/M)
    return call_model("deepseek-reasoner", prompt)  # 5% of requests

The quality_check function is the secret sauce. For us, it's a tiny classifier that evaluates whether the response actually addresses the user's intent. You can also use embedding similarity against gold-standard answers. Don't over-engineer this — a heuristic works fine in production.

The compounding savings here are wild. Going from "everything on GPT-4o" to "80% on Qwen3-8B, 15% on Flash, 5% on Reasoner" pushes us past 95% total reduction.

Cache the Obvious Stuff

I cannot tell you how much money we were leaving on the table by not caching. FAQ queries, documentation lookups, common error messages — these all hit our API dozens of times per minute with identical or near-identical prompts.

A simple TTL cache cut a meaningful slice off our bill, and the implementation took about an hour:

import hashlib, json, time

cache = {}

def cached_chat(model, messages, ttl=3600):
    key = hashlib.md5(
        json.dumps({"model": model, "messages": messages}).encode()
    ).hexdigest()

    if key in cache:
        entry = cache[key]
        if time.time() - entry["time"] < ttl:
            return entry["response"]  # Cache hit — $0 cost

    response = requests.post(
        f"{API_BASE}/chat/completions",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"model": model, "messages": messages}
    ).json()

    cache[key] = {"response": response, "time": time.time()}
    return response

For workloads where users hit predictable questions (think "how do I reset my password?"), we saw 50-80% cache hit rates. That's pure margin. Zero cost, instant latency.

The architectural lesson here: caching is a layer, not an afterthought. Put it in front of your routing logic, not after.

Compress Prompts Before They Ship

Here's a number that should get your attention: a 2,000-token system prompt compressed to 400 tokens saves $0.024 per request on DeepSeek V4 Flash. At 10,000 requests per day, that's $240/day — or $87,600/year. From one prompt.

We had several system prompts that had grown over months of feature creep. Stuff that was important in v1 but irrelevant by v3. Stuff that was duplicated. Stuff that was verbose for no reason.

The fix is mechanical: run your bloated context through a cheap summarizer before sending it to the expensive model.

def compress_prompt(text, target_ratio=0.5):
    """Compress long prompts before sending"""
    if len(text) < 500:
        return text  # Already short

    # Use a cheap model to summarize the context
    summary = call_model("Qwen/Qwen3-8B",
        f"Summarize this in {int(len(text)*target_ratio)} chars: {text}"
    )
    return summary

In practice, the 15-30% per-request savings compounds across your whole system. It's not as dramatic as model selection, but it's also nearly free to implement.

One caveat: don't compress prompts that are inherently precise — function definitions, few-shot examples, structured outputs. The summarizer will eat those.

Batch Like You Mean It

The last technique I want to talk about is batching. If you're processing lists of similar items — say, classifying 500 support tickets — you're paying for 500 round trips and 500 sets of system prompt tokens. That's insane.

The fix is to stuff all 500 items into a single prompt and parse the structured output:

def batch_classify(items, model="deepseek-v4-flash"):
    prompt = "Classify each item as positive/neutral/negative. Return JSON array.\n"
    for i, item in enumerate(items):
        prompt += f"{i+1}. {item}\n"

    response = requests.post(
        f"{API_BASE}/chat/completions",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "response_format": {"type": "json_object"}
        }
    ).json()

    return parse_results(response)

Typical savings: 10-20%, plus a huge latency win because you're doing one network round trip instead of N.

The trade-off is that batching adds complexity to error handling — one bad item can poison the whole batch if you're not careful. We isolate items with delimiters and validate outputs against the expected count. Production-ready, but not free.

The Vendor Lock-in Question

I want to spend a minute on something that doesn't show up in most cost optimization guides but absolutely should: vendor lock-in.

When I started this project, we were 100% on OpenAI. Every API call, every embedding, every fine-tuned model — all OpenAI. If they had raised prices 3x overnight, we would have been toast. If they had a regional outage, we would have been toast.

That's a terrifying position for a startup. So one of my architectural goals was to make our LLM layer swappable in an afternoon, not a quarter.

Three things made this possible:

A unified API endpoint. We route everything through a single abstraction. The model name is a string parameter, not a hardcoded import. This means I can run a side-by-side benchmark between, say, DeepSeek V4 Flash and a hypothetical new entrant in a day.
Prompts that aren't provider-specific. We standardized on OpenAI's chat completion format early on, which turned out to be a lucky accident. Most providers now support it. If you've locked yourself into Anthropic's Messages API or Google's specific format, you have more friction.
No stateful features we can't reproduce elsewhere. We deliberately avoid features that only one provider offers well — like OpenAI's Assistants API with its persistent threads. The lock-in tax on those is enormous.

This is also why the code samples in this post all hit a unified endpoint. If you build your system right, switching providers — or running multiple providers in parallel — is a config change, not a migration.

What the Final Picture Looks Like

Six months in, here's our actual cost breakdown compared to where we started:

Model selection: ~90% reduction
Tiered routing: pushed total to ~95% reduction
Caching: 20-50% additional savings on top
Prompt compression: 15-30% per request where it applies
Batching: 10-20% on our batch workloads

Combined: roughly 95% reduction across our LLM spend. Our absolute dollar spend is down about 95%. Quality metrics — measured by a held-out eval set we run weekly — are essentially flat, with modest improvements in latency-sensitive paths because smaller models respond faster.

The compounding math is the part that excites me most. If we had stayed on the original architecture and just grown linearly with usage, we'd be paying roughly 20x what we pay today. By restructuring the LLM layer, we essentially bought ourselves 20x headroom for product growth.

That's the real ROI story here. This isn't a cost-cutting exercise — it's an enabler for scale.

What I'd Do Differently

A few things I learned the hard way that might save you some pain:

Don't trust benchmarks blindly. Synthetic benchmarks said Qwen3-8B was great for our classification task. It was

DEV Community

How I Cut Our AI API Bill by 95% — A Practical Guide for 2026

The Audit Nobody Wants to Do

Match the Model to the Job

Tiered Routing: Cheap First, Escalate Only When Needed

Cache the Obvious Stuff

Compress Prompts Before They Ship

Batch Like You Mean It

The Vendor Lock-in Question

What the Final Picture Looks Like

What I'd Do Differently

Top comments (0)