DEV Community

bolddeck
bolddeck

Posted on

How I Cut Our AI API Bill in Half While Hitting p99 SLAs

Here's the thing: how I Cut Our AI API Bill in Half While Hitting p99 SLAs

I still remember the morning my phone buzzed with a PagerDuty alert at 4:12 AM. It wasn't an outage. It was worse. Our AI inference bill had crossed the budget threshold for the third month in a row, and my CFO had started using phrases like "unsustainable burn rate" in our weekly sync. So began my six-week deep dive into AI API economics, and I'm here to walk you through what I learned — from a cloud architect's seat, with one eye on latency dashboards and the other on the invoice.

The Stack I Inherited

When I joined the team last year, our inference layer was, to put it charitably, a little naive. We were running everything through GPT-4o because, well, it was the model everyone knew. We had three regions active — us-east-1, eu-west-1, and ap-southeast-1 — each pulling from a single upstream provider. Latency was "fine" in the sense that p95 looked acceptable, but our p99 numbers told a different story. And the bill? The bill was a horror show.

That's when I started treating AI API selection the way I treat any other piece of infrastructure: as a system design problem with reliability constraints, cost budgets, and SLAs to honor.

SLA First, Price Second

Here's something I wish more teams internalized: the cheapest model per token isn't the cheapest model in production. If you swap to a budget model that doubles your p99 latency, you've traded a financial cost for a user experience cost, and that bill eventually comes due in churn metrics.

So my approach became: define the SLA first, then find the cheapest model that meets it. For our workloads — a mix of structured extraction, summarization, and conversational AI — we landed on these targets:

  • p99 latency under 1.5 seconds for chat workloads
  • 99.9% availability across any single calendar month
  • Multi-region failover with automatic rerouting under 800ms
  • Throughput headroom of 3x peak observed load (because auto-scaling should never feel like a panic)

Once those numbers were written down, model selection became a much more pleasant exercise.

The Pricing Table That Changed My Mind

I'll be honest: I'd been dismissive of the newer model families until I actually pulled the data into a spreadsheet. Global API exposes 184 models at this point, with token prices ranging from $0.01 to $3.50 per million tokens depending on tier. Here's the subset that ended up on my shortlist:

Model Input ($/M) Output ($/M) Context Window
DeepSeek V4 Flash 0.27 1.10 128K
DeepSeek V4 Pro 0.55 2.20 200K
Qwen3-32B 0.30 1.20 32K
GLM-4 Plus 0.20 0.80 128K
GPT-4o 2.50 10.00 128K

Look at that table for a moment. I mean really look. GPT-4o's output price of $10.00 per million tokens is over nine times what we'd pay on GLM-4 Plus for the same completion. And — here's the part that initially gave me hives — our internal quality benchmarks showed GLM-4 Plus hitting 84.6% on our evaluation suite, which was within the noise floor of what GPT-4o scored on the same tasks.

I won't lie, I spent a weekend re-running evals before I believed it. But the numbers were real.

Building the Multi-Region Failover Layer

Once I had my shortlist, the architecture work began. I needed a layer that could:

  1. Route requests to the right model based on workload type
  2. Fail over to a secondary model if p99 latency on the primary breached threshold
  3. Spread traffic across multiple regions
  4. Cache aggressively (which I'll get to in a bit)

Here's the routing module I ended up writing. It's not fancy, but it's reliable, and that's the only adjective that matters when you're on call:

import os
import time
import openai
from typing import Optional

class InferenceRouter:
    def __init__(self):
        self.client = openai.OpenAI(
            base_url="https://global-apis.com/v1",
            api_key=os.environ["GLOBAL_API_KEY"],
        )
        self.primary = "deepseek-ai/DeepSeek-V4-Pro"
        self.fallback = "deepseek-ai/DeepSeek-V4-Flash"
        self.economy = "THUDM/glm-4-plus"
        self.p99_breach_threshold_ms = 1500

    def classify(self, prompt: str) -> str:
        """Cheap heuristic — short prompts go economy, complex ones go pro."""
        if len(prompt) < 400 and "summarize" in prompt.lower():
            return self.economy
        return self.primary

    def complete(self, prompt: str, stream: bool = False) -> dict:
        model = self.classify(prompt)
        start = time.monotonic()
        try:
            response = self.client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                stream=stream,
                timeout=10,
            )
            elapsed_ms = (time.monotonic() - start) * 1000
            return {
                "response": response,
                "model_used": model,
                "elapsed_ms": elapsed_ms,
            }
        except Exception as primary_error:
            response = self.client.chat.completions.create(
                model=self.fallback,
                messages=[{"role": "user", "content": prompt}],
                stream=stream,
                timeout=10,
            )
            elapsed_ms = (time.monotonic() - start) * 1000
            return {
                "response": response,
                "model_used": self.fallback,
                "elapsed_ms": elapsed_ms,
                "degraded": True,
            }
Enter fullscreen mode Exit fullscreen mode

That try/except block is doing more work than it looks like. In the first three weeks of running this in production, the fallback path triggered exactly seven times. Each time, the user saw no error — they saw a slightly slower response, which is the entire point of an SLA-aware architecture.

Caching: The Boring Optimization That Pays For Everything

I'll tell you a secret about cloud economics: the best dollar you save is the dollar you don't spend. Before I touched the model selection logic at all, I instrumented a Redis cache in front of our inference endpoints. We had a surprising amount of repeat traffic — recurring questions, templated prompts, autocomplete-style inputs — and the hit rate climbed to 40% within a week.

40% hit rate on a high-traffic inference path is enormous. It directly translated to a 40% reduction in token spend on those queries, and it also dropped our p99 latency from 1.4 seconds to under 200 milliseconds for cached responses. That's the kind of win where finance is happy and product is happy, and as a cloud architect those are the wins that get you a bigger budget next quarter.

Streaming for Perceived Latency

This one isn't really a cost optimization, but it lives in the same neighborhood. Streaming completions through Server-Sent Events doesn't reduce token cost — you still pay for every token that comes out — but it dramatically lowers perceived latency. Users see the first token in around 200-300ms, and the experience feels responsive even when the full completion takes 1.2 seconds.

Our throughput stats settled at around 320 tokens per second per region once we tuned the batching, which gave us enough headroom to ride out traffic spikes without scaling the cluster out. Auto-scaling kicked in maybe twice a month during promotional campaigns, and the rest of the time we ran lean.

The Production Numbers (After Three Months)

Let me put some actual numbers on this so you can calibrate expectations for your own migration:

  • Monthly inference spend dropped 58% year-over-year
  • p99 latency on chat endpoints: 1.18 seconds (down from 1.6s on GPT-4o)
  • 99.95% observed uptime across all three regions
  • Cache hit rate stabilized at 42%
  • Average benchmark score across our eval suite: 84.6%

That last bullet is the one I want to highlight. The narrative in some corners of our industry is that cheaper models mean worse quality. That's not what we observed. For our specific workload mix — structured extraction, summarization, classification — the gap between GPT-4o and the DeepSeek/GLM/Qwen tier was within the noise floor of our evaluation methodology.

A Few Hard-Learned Lessons

If I could go back and tell my past self three things before starting this project, they'd be these:

First, don't trust a single benchmark. I built an internal eval suite of 400 prompts drawn from real production traffic (with PII scrubbed, obviously). Generic benchmarks like MMLU are useful for orientation but they're not your traffic.

Second, watch the rate-limit dashboard like a hawk during the first week. We hit an unexpected quota ceiling on day three because we'd underestimated weekend traffic from the APAC region. The failover logic saved us, but only because it existed.

Third, set up a kill switch. I'm serious. We had one Friday afternoon where a misconfigured prompt template started generating 8K-token outputs on what should have been 200-token summaries. A kill switch in our router — one that caps max_tokens per request based on prompt type — saved us from a roughly $14,000 incident.

A Tiny Script I Wish I'd Written Sooner

Here's the small monitoring script I now have running on every inference pod. It tracks p99 latency over rolling windows and alerts when we breach the SLA. It's embarrassingly simple, but it's caught three drift incidents before users noticed:

import time
import statistics
from collections import deque

class P99Tracker:
    def __init__(self, window_size: int = 1000, sla_ms: float = 1500):
        self.window = deque(maxlen=window_size)
        self.sla_ms = sla_ms
        self.breach_count = 0

    def record(self, elapsed_ms: float):
        self.window.append(elapsed_ms)
        if len(self.window) >= 100 and self._p99() > self.sla_ms:
            self.breach_count += 1
            if self.breach_count % 10 == 0:
                print(f"[ALERT] p99 latency {self._p99():.1f}ms exceeds SLA {self.sla_ms}ms")

    def _p99(self) -> float:
        sorted_window = sorted(self.window)
        idx = int(len(sorted_window) * 0.99)
        return sorted_window[idx]
Enter fullscreen mode Exit fullscreen mode

That's it. No Prometheus exporter, no fancy histograms. Just a deque, a sort, and an alert. Run it as a sidecar, ship the breach counts to your observability stack, and move on with your life.

What I'd Tell Another Architect

If you're staring down a runaway AI API bill right now, here's the path I'd recommend. Start by writing down your SLA in concrete terms — p99 latency, uptime, throughput — and refuse to discuss model selection until that document exists. Then build a small eval suite from your real production traffic. Then look at the pricing tables honestly, including the providers you've been ignoring.

The market in 2026 is genuinely competitive. With 184 models on offer through Global API, with prices spanning from $0.01 to $3.50 per million tokens, there's almost certainly a

Top comments (0)