Aria13

Posted on May 8 • Edited on May 9 • Originally published at forge.closerhub.app

The AI API Cost Optimization Handbook Nobody Wrote (But Every Dev Needs)

#showdev #ai #tutorial #discuss

📥 TL;DR — Want the complete playbook? This article covers the concepts. The full guide includes production-ready frameworks, real examples, and actionable checklists.

→ Get the guide · Also on Gumroad — 12€, instant PDF · 30-day refund

You shipped your AI feature. Users love it. Then the invoice arrives and you realize you're spending $4,000/month on tokens for an app with 200 users. That math doesn't work.

I've spent the last year optimizing LLM costs across production systems — cutting bills from $3,200/month down to $480 without touching product quality. Here's the actual playbook.

1. Understand Where Your Money Actually Goes

Before optimizing anything, instrument your costs. Most teams fly blind.

Add token tracking to every API call:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}]
)

# Log this to your analytics
print(f"Input: {response.usage.input_tokens} | Output: {response.usage.output_tokens}")

When you aggregate a week of logs, you'll almost always find two things:

20% of your call types consume 80% of your costs (usually long-context or high-frequency endpoints)
Output tokens are your biggest lever — they cost 3-5x more than input tokens per unit

Fix the expensive 20% first. Everything else is noise.

2. Prompt Caching Is a Cheat Code

If you're not using prompt caching on Anthropic or OpenAI, you're leaving the single biggest optimization on the table.

The idea: mark parts of your prompt as cacheable. When the same prefix hits the API again, you pay ~10% of the normal input cost. For system prompts, documents, or tool definitions that stay constant across requests — this is massive.

On Anthropic:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=conversation_history
)

If your system prompt is 2,000 tokens and you're handling 10,000 requests/day, that's 20M tokens/day. At $3/M input tokens vs $0.30/M cached — you've just saved $54/day without changing a single user-facing behavior.

RAG pipelines are the obvious win here: cache your retrieved context chunks across similar queries.

3. Right-Size Your Models Ruthlessly

Using GPT-4o or Claude Opus for everything is like renting a helicopter to commute to work. It works. It's absurd.

The practical tiering that works in production:

Task	Model	Why
Classification, routing, intent detection	Haiku / GPT-4o-mini	Fast, cheap, accurate enough
Summarization, extraction, RAG answers	Sonnet / GPT-4o	Good balance
Complex reasoning, code gen, multi-step	Opus / o3	Actually needs it

Build a routing layer. Before hitting your expensive model, run a cheap classifier:

def route_request(user_message: str) -> str:
    # Simple heuristics first — no LLM needed
    if len(user_message) < 50 and is_simple_query(user_message):
        return "haiku"

    # Use a cheap model to classify complexity
    complexity = classify_with_haiku(user_message)
    return "opus" if complexity == "high" else "sonnet"

In practice, 60-70% of requests can be handled by cheaper models. Users don't notice the difference on straightforward tasks.

4. Output Token Control Is Underrated

Developers obsess over input token optimization but ignore output tokens — which are where most bills accumulate.

Three tactics that work immediately:

Set hard max_tokens limits. Don't let the model write a novel when you need a paragraph. If your feature returns summaries, 300 tokens should be your ceiling, not 4,096.

Explicit length instructions in your prompt. "Respond in 2-3 sentences maximum" works better than you'd expect. LLMs follow length instructions reliably.

Use structured output to eliminate padding. Instead of asking for freeform text and parsing it, request JSON:

# Verbose: "Please analyze the sentiment of the following text and explain your reasoning..."
# → 200+ tokens of hedging and explanation

# Tight: "Classify sentiment. Return JSON: {sentiment: positive|negative|neutral, score: 0-1}"  
# → 20 tokens, same signal

Switching from freeform to structured output on classification tasks routinely cuts output costs by 70-90%.

5. Semantic Caching at the Application Layer

API-level caching handles identical prompts. Semantic caching handles similar ones — which is where real production traffic lives.

The pattern: embed incoming queries, check cosine similarity against a cache of recent query-response pairs, serve cached responses when similarity exceeds a threshold (0.92 works well in practice).

def semantic_cache_lookup(query: str, cache: list, threshold=0.92):
    query_embedding = embed(query)

    for cached_query, cached_response, cached_embedding in cache:
        similarity = cosine_similarity(query_embedding, cached_embedding)
        if similarity >= threshold:
            return cached_response  # Cache hit — zero API cost

    return None  # Cache miss — call the API

For FAQ-style apps, support bots, or any domain with query clustering, this can cut API calls by 30-50%. Users asking "how do I reset my password" and "forgot my password how to reset" get the same answer — you only paid once.

Use Redis with vector search (RedisStack) or a lightweight in-memory store for prototypes.

6. Batch Processing and Async Patterns

Not every request needs a real-time response. This distinction is worth thousands of dollars per month.

Anthropic's Batch API (and OpenAI's equivalent) costs 50% less than standard API calls. If you're running enrichment pipelines, nightly reports, bulk classification, or any non-interactive workload — there is zero reason to use synchronous calls.

# Instead of 1,000 sequential API calls at full price
results = await asyncio.gather(*[
    client.messages.create(...) for item in batch
])

# Use batch API at half cost — results in ~24 hours
batch = client.messages.batches.create(requests=[...])

Beyond batching: audit your event-driven flows. Teams often trigger LLM calls on every webhook, every database write, every user action — when the actual requirement is "process this within 5 minutes." Queuing and processing in batches costs less and reduces rate-limit errors.

The Compounding Effect

These optimizations aren't additive — they multiply.

A real example: a customer support bot running GPT-4o on everything, no caching, no output limits, synchronous for all requests.

Switch classification to GPT-4o-mini: −55% cost
Add prompt caching for system prompt + docs: −40% of remaining
Output token limits + structured responses: −35% of remaining
Semantic cache layer: −25% of remaining

Combined: original $3,200 bill → $480. The product got faster and users noticed nothing.

The mistake is treating LLM costs as fixed infrastructure. They're not. They're engineering decisions, and the right decisions compound hard.

I compiled everything into a practical guide: AI API Cost Optimization Handbook · Also on Gumroad

DEV Community