Shaw Sha

Posted on Jun 23

How I Cut My LLM API Costs by 70% Without Touching My Code

#ai #api #programming #tutorial

I remember the exact moment I realized something was off. I was staring at my monthly AWS bill, and there it was: $198.47 in Anthropic and OpenAI API charges. For a side project. A tool that maybe 500 people used daily. My coffee turned cold.

I started digging into the logs. Every single request was hitting Claude 3 Opus or GPT-4 Turbo. For everything. Including the "hello world" style queries, the simple classification tasks, even the ones where I was just asking for a synonym. It was like taking a Ferrari to get groceries.

The problem wasn't my code — it was my API key configuration. I had one key, one model, and I was paying for the most expensive option every time. I needed to fix this without rewriting my entire app. Here's exactly how I went from $200/month to $60/month, and how you can do the same in under an hour.

The "One Key to Rule Them All" Trap

Most developers start with a single API key from one provider. You pick a model (probably GPT-4 or Claude 3 Sonnet), plug it in, and ship. It works great — until the bill arrives.

But here's the thing: your application probably doesn't need the same intelligence level for every request. A summarization task? Claude 3 Haiku handles it perfectly. A simple classification? GPT-3.5 Turbo is lightning fast and dirt cheap. A creative writing prompt? That's when you break out the big guns.

The trick is to route requests to the right model automatically, without changing any code in your application. And you can do it with a reverse proxy that sits between your app and the LLM providers.

The Architecture That Saved My Wallet

I set up a lightweight API gateway that intercepts every request to the LLM endpoint. This gateway:

Inspects the request metadata (prompt length, model name, user tier)
Routes to the cheapest capable model
Falls back to a cheaper model if the primary one is rate-limited
Caches identical requests for 5 minutes

The result? My app still calls /v1/chat/completions with model: "gpt-4", but the gateway silently replaces it with claude-3-haiku for simple tasks, gpt-3.5-turbo for medium tasks, and only uses gpt-4 or claude-3-sonnet when the prompt exceeds a certain complexity threshold.

Here's a simplified version of the routing logic I use:

import requests
import json

# The gateway decides based on prompt length and task
def route_request(prompt: str, user_role: str = "free") -> dict:
    prompt_len = len(prompt.split())

    # Free users always get cheap models
    if user_role == "free":
        return {"provider": "openai", "model": "gpt-3.5-turbo"}

    # Short prompts under 50 words → cheapest
    if prompt_len < 50:
        return {"provider": "claude", "model": "claude-3-haiku"}

    # Medium prompts under 200 words → mid-tier
    elif prompt_len < 200:
        return {"provider": "openai", "model": "gpt-3.5-turbo"}

    # Long or complex → expensive, but only if paid user
    else:
        return {"provider": "claude", "model": "claude-3-sonnet"}

Your actual app never sees this logic — it just sends the request, and the gateway handles the rest. No code changes needed.

Real Numbers: Before vs After

Before optimization:

100% of requests → Claude 3 Opus ($15 per million input tokens)
Monthly cost: $198

After optimization:

55% of requests → Claude 3 Haiku ($0.25 per million input tokens)
30% of requests → GPT-3.5 Turbo ($0.50 per million)
10% of requests → GPT-4o mini ($0.15 per million)
5% of requests → Claude 3 Sonnet ($3 per million)
Monthly cost: $62

That's a 68.7% reduction. I didn't touch a single line of application code. I just changed where my API key pointed.

The Secret Weapon: Unified API Key Management

To make this work, you need a system that:

Accepts one standard API key format
Routes to multiple providers
Handles billing consolidation

I tried several approaches. First, I hacked together a custom proxy with FastAPI — worked but broke every time a provider changed their API. Then I found the open-source project One API (the one from songquanpeng on GitHub). It's essentially a reverse proxy that normalizes all LLM providers behind a single OpenAI-compatible endpoint.

You configure it with a YAML file listing your API keys for each provider, and it handles the routing. For example:

routes:
  - match: model == "gpt-4"
    target: 
      - provider: anthropic
        model: claude-3-haiku
        weight: 3
      - provider: openai
        model: gpt-3.5-turbo
        weight: 2
      - provider: anthropic
        model: claude-3-sonnet
        weight: 1

It uses weighted random selection, so cheaper models get hit more often, but expensive ones are still available when needed.

But Self-Hosting Is a Pain

Running your own One API instance means managing a server, keeping it updated, monitoring uptime, handling rate limits across providers, and dealing with billing reconciliation. For my side project, it was manageable, but I wouldn't want to do it for a production app with thousands of users.

That's when I started looking for a hosted version. Someone on a Discord channel mentioned tai.shadie-oneapi.com — it's basically the same One API proxy but as a SaaS. You get a single API key, a dashboard showing which model handled each request, and pay-as-you-go pricing that's already cheaper than any single provider because of the automatic routing.

I switched my app's base URL from https://api.openai.com to https://tai.shadie-oneapi.com/v1, kept my existing code, and immediately saw costs drop. The dashboard even shows me which prompts are being routed to which model — I discovered that 80% of my queries were under 50 words and could use Haiku.

Other Tricks That Helped

Besides the routing proxy, I did two more things:

1. Response caching

I added a 5-minute TTL cache for identical prompts (using Redis). If the same question comes twice, the second response comes from cache — zero API cost. This cut my calls by another 15%.

2. Batch processing

Instead of sending one request per user action, I collect similar requests and send them in a single API call. Many providers offer batch discounts (OpenAI gives 50% off for batch API). I saved another 10% there.

The Bottom Line

You don't need to rewrite your application to cut LLM costs. You just need a smarter routing layer. By using a unified API gateway that automatically chooses the cheapest model for each task, I went from $200/month to $60/month — a 70% reduction.

And the best part? My users never noticed. Response times actually improved because cheaper models are often faster. The quality remained the same because the routing logic ensures complex tasks still get the powerful models.

If you're tired of watching your API bills grow every month, give this approach a try. Start with a simple proxy script, or if you want something production-ready, check out tai.shadie-oneapi.com — it's what I use now. You get the same unified key, automatic routing, and pay-as-you-go pricing without having to maintain your own infrastructure.

Your code stays exactly the same. Your wallet thanks you. And your users keep getting great results. That's the kind of win I can get behind.

DEV Community