Mudassir Khan

Posted on Jul 1

LLM Cost Optimization: Cutting Inference Bills Without Killing Quality

#ai #llm #tutorial #webdev

You can cut your LLM API spend by 50 to 90% without switching models or degrading output quality. The techniques exist, the docs are public, and most teams are not using them. Here is what actually moves the needle.

Where your LLM bill actually comes from

Every API call charges you for input tokens plus output tokens. Simple math, but "input tokens" is a bigger footgun than it looks.

Most production workloads send the same system prompt, instructions, or retrieval context on every single request. If your system prompt is 4,000 tokens and you fire 100 requests per minute, that's 400,000 input tokens per minute burning at full price. Before you optimize anything else, map where your token spend is actually going.

Three buckets drive most bills:

Repeated context: system prompts, tool definitions, retrieved documents sent on every call
Frontier model overuse: sending simple classification tasks to GPT-4o or Claude Sonnet when a cheaper model handles them just fine
Bloated outputs: getting paragraph prose back when you need a 10-token structured field

Once you know which bucket hurts most, the fix is obvious. The rest of this article covers each one with concrete numbers.

Prompt caching: the highest ROI optimization nobody uses

If your app sends the same prefix on every call (system prompt, tool definitions, few shot examples), prompt caching is the single highest impact change you can make.

Anthropic's caching API lets you mark a portion of your input as cacheable with a cache_control block. The first call writes the cache at 25% of the normal input token price. Every subsequent cache hit costs 10% of normal. That is a 90% cost reduction on the cached portion.

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """
You are a code review assistant. Your job is to identify bugs,
suggest improvements, and flag security issues in submitted code.
... (your 2000-token system prompt here) ...
"""

response = client.messages.create(
    model="claude-sonnet-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": user_code}
    ]
)

On the first call you pay the cache write cost (25% of normal input price). From the second call onward, every hit costs 10% of what you would normally pay. For an app firing 1,000 calls per hour against a 3,000-token system prompt, that is 2.7 million tokens per hour dropping from full price to 10% of full price.

OpenAI has equivalent behavior through automatic prompt caching. The implementation differs slightly but the economics are similar.

Prompt caching helps most when: your system prompt is over 1,000 tokens, you are running a RAG pipeline that injects the same documents repeatedly, or you have an agent framework with large tool definitions reused across turns.

Model routing: stop sending everything to the frontier model

Not every task needs GPT-4o or Claude Sonnet. Intent classification, slot extraction, summarization, and simple Q&A are well within what GPT-4o-mini or Claude Haiku can handle. The cost difference is roughly 10 to 20 times per token.

Routing 40% of traffic to a smaller model can halve your total API spend without touching a single user facing feature. The key is picking the right tasks.

import OpenAI from "openai";

const client = new OpenAI();

type TaskComplexity = "simple" | "complex";

function classifyTask(userMessage: string): TaskComplexity {
  const reasoningKeywords = [
    "explain why",
    "compare",
    "analyze",
    "critique",
    "design architecture"
  ];

  const needsReasoning = reasoningKeywords.some(kw =>
    userMessage.toLowerCase().includes(kw)
  );

  if (!needsReasoning && userMessage.length < 200) {
    return "simple";
  }
  return "complex";
}

async function routedCompletion(userMessage: string) {
  const complexity = classifyTask(userMessage);
  const model = complexity === "simple" ? "gpt-4o-mini" : "gpt-4o";

  const response = await client.chat.completions.create({
    model,
    messages: [{ role: "user", content: userMessage }],
  });

  return { response, model_used: model };
}

This is a naive router based on keywords and message length. Production versions use a small classifier (or another LLM) to score complexity before routing. Projects like LiteLLM and RouteLLM give you prebuilt routers if you don't want to build your own.

The failure mode to watch: tasks that look simple ("what does this error mean?") but need surrounding context to answer well. Build an escalation path so users can retry against the full model when the cheaper one falls short.

Batching and async inference for workloads that can wait

If you are running evals, processing documents, generating reports, or doing any bulk inference where a 24-hour window is acceptable, the OpenAI Batch API cuts costs by 50% versus synchronous pricing.

You submit a JSONL file. OpenAI processes it asynchronously. You pull results when they are ready. No streaming, no realtime SLAs.

import openai
import json
from pathlib import Path

client = openai.OpenAI()

requests = []
for doc_id, text in documents.items():
    requests.append({
        "custom_id": doc_id,
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "user", "content": f"Summarize in 3 sentences: {text}"}
            ],
            "max_tokens": 150
        }
    })

batch_path = Path("/tmp/batch_input.jsonl")
batch_path.write_text("\n".join(json.dumps(r) for r in requests))

with open(batch_path, "rb") as f:
    batch_file = client.files.create(file=f, purpose="batch")

batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch submitted: {batch.id}")
# Poll client.batches.retrieve(batch.id) to check status

Stack this with model routing: batch jobs on the cheap model with the 50% discount applied on top. That is where the cost math gets seriously interesting.

Anthropic offers similar async batch processing. The API shape differs slightly but the pricing pattern is the same.

Output schema discipline to shrink response tokens

Freeform prose outputs are expensive. When your downstream code parses a JSON blob out of a markdown code block, you are paying for the explanation text, the code fences, and the commentary. None of that is useful to you.

Structured outputs fix this. Both OpenAI and Anthropic support schema constrained generation that returns only the fields you defined, in valid JSON, every time.

from anthropic import Anthropic
import json

client = Anthropic()

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=256,
    tools=[{
        "name": "extract_sentiment",
        "description": "Extract sentiment analysis from text",
        "input_schema": {
            "type": "object",
            "properties": {
                "sentiment": {
                    "type": "string",
                    "enum": ["positive", "negative", "neutral"]
                },
                "confidence": {"type": "number"},
                "topics": {
                    "type": "array",
                    "items": {"type": "string"},
                    "maxItems": 3
                }
            },
            "required": ["sentiment", "confidence", "topics"]
        }
    }],
    tool_choice={"type": "tool", "name": "extract_sentiment"},
    messages=[{"role": "user", "content": user_text}]
)

result = json.loads(response.content[0].input)

The schema constraint tells the model to stop once the required fields are populated. No hedging, no markdown formatting, no explanatory prose it doesn't need to write. Output token counts drop noticeably on structured tasks compared to the freeform equivalent.

This one bit me in prod before I switched: an eval pipeline returning full explanation paragraphs for every item when all I needed was a score and a label. Two fields. The verbosity was pure waste.

FAQ

How do you reduce LLM API costs?

The highest ROI levers in order: (1) prompt caching for repeated context, (2) model routing for tasks that don't need frontier models, (3) batching for async workloads, (4) schema constrained outputs to shrink response size. Each one targets a different cost bucket and they stack.

Is prompt caching worth it for LLM cost savings?

Yes, if your system prompt or retrieval context exceeds 1,000 tokens and you send it repeatedly. Anthropic's cache read price is 10% of normal input pricing. On a 4,000-token system prompt fired 500 times per day, you pay full price once and 10% for the other 499. The math is not subtle.

What is the cheapest way to run an LLM in production?

Self hosted open weight models (Llama, Mistral, Qwen) undercut API pricing at scale but require infrastructure you have to operate. For workloads running against vendor APIs: combine routing (cheap model for simple tasks) plus batching (50% off async) plus caching (90% off repeated context). Most teams are not doing all three, which means most teams are overpaying.

I go deeper on production LLM inference optimization on my blog, including a breakdown of what each technique actually saves at different call volumes.

If you want this wired up on your own stack end to end, that is exactly the kind of work I take on.

What routing or caching setup is your team running in production? Drop a comment. Curious what numbers people are actually seeing.

Top comments (2)

HuiXia-Meshs • Jul 6

Good practical guide. One pattern I keep seeing work well: pair a cheap model (DeepSeek V4 Flash) as default with a higher-quality fallback (Claude Sonnet) only when confidence scoring flags a low-quality response. That hybrid approach can cut costs 60-80% while keeping task quality within 5% of "always use frontier." The missing piece in most tutorials is how to set up that confidence threshold reliably.

Mudassir Khan • Jul 7

the confidence threshold question is the right one — most teams start with a fixed score cutoff (e.g. 'if confidence < 0.7, escalate') and then spend weeks tuning it because the cheap model's self reported confidence is often miscalibrated, especially on out of distribution inputs.

what worked better for us: use a separate judge call (cheap model, 10 tokens) that scores the response on a rubric, not the confidence the main model reports on itself. latency hit is small; accuracy on 'should this have escalated' improved noticeably.

are you running the confidence check as a secondary call or reading it from the primary response?