Jesse

Posted on May 27 • Edited on May 29

How I Cut My LLM API Costs by 90%: A Battle-Tested Token Optimization Guide

Last Updated: May 2026

Audience: Backend engineers, ML engineers, and product developers building LLM-powered applications

Introduction

If you're building production LLM applications, you've probably watched your token costs spiral out of control faster than expected. A moderately successful chatbot can easily burn through $10,000/month, and a high-traffic API integration can hit six figures.

This guide covers battle-tested strategies for reducing token consumption without sacrificing output quality. These techniques are based on industry best practices and can be applied to any LLM provider.

What you'll learn:

How to reduce input tokens by 60-90% with prompt caching
When to use batch processing for 50% cost savings
How to choose the right model for each task
Architecture patterns that scale token efficiency

1. Understanding the Cost Structure

Before optimizing, you need to understand where your money goes.

Current Pricing (May 2026)

International Providers:

Provider	Model	Input (per MTok)	Output (per MTok)	Context Window
Anthropic	Claude Opus 4.7	$5.00	$25.00	1M tokens
Anthropic	Claude Sonnet 4.6	$3.00	$15.00	1M tokens
Anthropic	Claude Haiku 4.5	$1.00	$5.00	200K tokens
OpenAI	GPT-4o	$2.50	$10.00	128K tokens
OpenAI	GPT-4o-mini	$0.15	$0.60	128K tokens
Google	Gemini 1.5 Pro	$3.50	$10.50	2M tokens

Chinese Providers:

Provider	Model	Input (per MTok)	Output (per MTok)	Context Window
DeepSeek	DeepSeek V4 Pro	$0.14	$0.28	128K tokens
DeepSeek	DeepSeek V4 Flash	$0.07	$0.14	128K tokens
Zhipu AI	GLM 5.1	$0.14	$0.28	128K tokens
Zhipu AI	GLM 5V Turbo	$0.14	$0.28	128K tokens

Key insight: Output tokens are typically 3-5x more expensive than input tokens. Optimizing output length often has the highest ROI.

Cost Formula

Total Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)

For a typical chatbot conversation (2000 input tokens, 500 output tokens) using DeepSeek V4 Pro:

Input cost: 0.002 × $0.14 = $0.00028
Output cost: 0.0005 × $0.28 = $0.00014
Total: $0.00042 per conversation

At 10,000 conversations/day, that's $4.20/day or ~$126/month.

Cost comparison: The same conversation using Claude Sonnet 4.6 would cost $0.0135—32x more expensive than DeepSeek V4 Pro.

2. Prompt Caching: The Highest-Impact Optimization

Prompt caching is the single most effective cost reduction technique available today. Both Anthropic and OpenAI now support it natively.

How It Works

When you send a request with prompt caching enabled, the provider caches the prefix of your prompt. Subsequent requests with the same prefix reuse the cached version, dramatically reducing both cost and latency.

Anthropic Prompt Caching

Anthropic's implementation offers 90% savings on cached input tokens:

Operation	Price (Claude Sonnet 4.6)
Base input	$3.00 / MTok
Cache write (5min TTL)	$3.75 / MTok (1.25x)
Cache write (1h TTL)	$6.00 / MTok (2x)
Cache read	$0.30 / MTok (0.1x)

Implementation:

import anthropic

client = anthropic.Anthropic()

# System prompt with cache control
system_prompt = [
    {
        "type": "text",
        "text": """You are an expert code reviewer. Follow these guidelines:
        - Focus on security vulnerabilities
        - Check for performance issues
        - Verify error handling
        - Suggest improvements with code examples

        [Your full system prompt here...]""",
        "cache_control": {"type": "ephemeral"}
    }
]

# First request: cache write
response1 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=system_prompt,
    messages=[{"role": "user", "content": "Review this Python function..."}]
)

# Subsequent requests: cache read (90% cheaper)
response2 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=system_prompt,  # Same system prompt = cache hit
    messages=[{"role": "user", "content": "Review this other function..."}]
)

Real-world savings: A code review tool processing 1,000 requests/day with a 2,000-token system prompt:

Without caching: $6.00/day for system prompt tokens
With caching: $0.60/day for system prompt tokens
Savings: $5.40/day ($162/month)

Automatic Caching (New in 2026)

Anthropic now supports automatic caching for multi-turn conversations:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},  # Auto-cache last block
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi! How can I help?"},
        {"role": "user", "content": "What's the weather?"}
    ]
)

The cache point automatically moves forward as conversations grow. No manual breakpoint management needed.

OpenAI Prompt Caching

OpenAI automatically caches prompts longer than 1,024 tokens (for most models). Cached input tokens are billed at 50% of the standard rate.

from openai import OpenAI

client = OpenAI()

# OpenAI automatically caches long prompts
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Your long system prompt here..."},
        {"role": "user", "content": "Your query here"}
    ]
)

No code changes required—caching happens automatically.

Best Practices for Prompt Caching

Place static content first: System prompts, tool definitions, and context should come before dynamic content.
Use explicit breakpoints strategically: For multi-section prompts, place cache_control on sections that change at different frequencies.
Pre-warm caches: Send a "warmup" request before users arrive to eliminate first-request latency.

# Pre-warm cache before users arrive
prewarm = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=0,  # No output needed
    system=[
        {
            "type": "text",
            "text": "Your system prompt...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "warmup"}]
)

Monitor cache hit rates: Track cache_read_input_tokens and cache_creation_input_tokens in API responses.

3. Model Selection: Right-Size for the Task

Not every task needs the most expensive model. Implement a routing system that matches tasks to appropriate models.

Task-Based Routing

import anthropic

def route_task(task_type: str, complexity: int) -> str:
    """Route tasks to appropriate models based on type and complexity."""

    routing_table = {
        # Simple tasks: use cheapest model
        "classification": {"low": "claude-haiku-4-5", "high": "claude-sonnet-4-6"},
        "summarization": {"low": "claude-haiku-4-5", "high": "claude-sonnet-4-6"},
        "translation": {"low": "claude-haiku-4-5", "high": "claude-sonnet-4-6"},

        # Complex tasks: use capable model
        "code_generation": {"low": "claude-sonnet-4-6", "high": "claude-opus-4-7"},
        "reasoning": {"low": "claude-sonnet-4-6", "high": "claude-opus-4-7"},
        "analysis": {"low": "claude-sonnet-4-6", "high": "claude-opus-4-7"},
    }

    complexity_level = "high" if complexity > 7 else "low"
    return routing_table.get(task_type, {}).get(complexity_level, "claude-sonnet-4-6")


def call_llm(prompt: str, task_type: str, complexity: int):
    """Call LLM with appropriate model based on task."""
    client = anthropic.Anthropic()
    model = route_task(task_type, complexity)

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

    return response.content[0].text

Cost Comparison

For a typical application with mixed tasks:

Task Distribution	Fixed (Sonnet)	Routed (Mixed)	Savings
60% simple tasks	$3.00/MTok	$1.00/MTok	67%
30% medium tasks	$3.00/MTok	$3.00/MTok	0%
10% complex tasks	$3.00/MTok	$5.00/MTok	-67%
Weighted average	$3.00/MTok	$1.80/MTok	40%

4. Batch Processing: 50% Savings for Async Workloads

For tasks that don't require immediate responses, batch processing offers 50% cost savings.

Anthropic Message Batches API

import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

client = anthropic.Anthropic()

# Create batch
batch = client.messages.batches.create(
    requests=[
        Request(
            custom_id=f"review-{i}",
            params=MessageCreateParamsNonStreaming(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=[{"role": "user", "content": f"Review code snippet {i}..."}]
            )
        )
        for i in range(100)  # 100 requests in one batch
    ]
)

# Poll for results
import time
while True:
    batch_status = client.messages.batches.retrieve(batch.id)
    if batch_status.processing_status == "ended":
        break
    time.sleep(60)

# Process results
for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        print(f"{result.custom_id}: {result.result.message.content[0].text}")

Batch Pricing (50% Discount)

Model	Standard Input	Batch Input	Standard Output	Batch Output
Claude Opus 4.7	$5.00	$2.50	$25.00	$12.50
Claude Sonnet 4.6	$3.00	$1.50	$15.00	$7.50
Claude Haiku 4.5	$1.00	$0.50	$5.00	$2.50

When to use batch processing:

Large-scale evaluations
Content moderation
Data analysis
Bulk content generation
Code review pipelines

Combining batch + caching: You can stack batch processing with prompt caching for up to 95% savings on input tokens (50% batch + 90% cache read).

5. Output Optimization

Since output tokens are 3-5x more expensive than input tokens, optimizing output length has high ROI.

Limit Output Length

# Explicit token limit
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,  # Limit output
    messages=[{"role": "user", "content": "Summarize this article in 3 bullet points."}]
)

# Prompt-based length control
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Explain quantum computing. Keep it under 100 words."
    }]
)

Structured Output

Request structured output to reduce verbose explanations:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": """Analyze this code. Return JSON:
        {
            "issues": ["issue1", "issue2"],
            "severity": "high|medium|low",
            "suggestions": ["suggestion1", "suggestion2"]
        }"""
    }]
)

Streaming

Streaming doesn't reduce token costs, but it improves perceived latency:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a function..."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

6. Context Management

Long conversations accumulate tokens quickly. Implement strategies to manage context efficiently.

Sliding Window

class ConversationManager:
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.messages = []

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._trim()

    def _trim(self):
        """Keep system prompt + recent messages within token budget."""
        total = self._count_tokens()
        while total > self.max_tokens and len(self.messages) > 2:
            # Remove oldest message (preserve system prompt)
            removed = self.messages.pop(1)
            total -= self._count_tokens([removed])

    def _count_tokens(self, messages=None):
        """Estimate token count (simplified)."""
        msgs = messages or self.messages
        return sum(len(m["content"]) // 4 for m in msgs)  # Rough estimate

Conversation Summarization

For very long conversations, periodically summarize:

def summarize_conversation(messages: list) -> list:
    """Compress long conversation into summary."""
    client = anthropic.Anthropic()

    summary_response = client.messages.create(
        model="claude-haiku-4-5",  # Use cheap model for summarization
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation in 50 words:\n{format_messages(messages)}"
        }]
    )

    summary = summary_response.content[0].text

    return [
        {"role": "system", "content": f"Previous conversation summary: {summary}"},
        messages[-1]  # Keep last message for context
    ]

7. Semantic Caching

For applications with repetitive queries, implement semantic caching to avoid redundant API calls.

import hashlib
import json
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}
        self.threshold = similarity_threshold

    def get(self, query: str):
        """Find semantically similar cached response."""
        query_embedding = self.model.encode(query)

        for cached_query, (cached_embedding, response) in self.cache.items():
            similarity = np.dot(query_embedding, cached_embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
            )
            if similarity > self.threshold:
                return response

        return None

    def set(self, query: str, response: str):
        """Cache response with query embedding."""
        embedding = self.model.encode(query)
        self.cache[query] = (embedding, response)

# Usage
cache = SemanticCache()

def get_llm_response(query: str) -> str:
    # Check cache first
    cached = cache.get(query)
    if cached:
        return cached

    # Call LLM if not cached
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    )

    result = response.content[0].text
    cache.set(query, result)
    return result

8. Monitoring and Cost Tracking

You can't optimize what you don't measure. Implement comprehensive token monitoring.

import time
from dataclasses import dataclass
from datetime import datetime

@dataclass
class TokenUsage:
    timestamp: datetime
    model: str
    input_tokens: int
    output_tokens: int
    cache_read_tokens: int
    cache_write_tokens: int
    cost: float

class TokenMonitor:
    def __init__(self):
        self.usage_log = []

    def log(self, model: str, input_tokens: int, output_tokens: int,
            cache_read: int = 0, cache_write: int = 0):
        """Log token usage with cost calculation."""
        cost = self._calculate_cost(model, input_tokens, output_tokens, 
                                     cache_read, cache_write)

        usage = TokenUsage(
            timestamp=datetime.now(),
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cache_read_tokens=cache_read,
            cache_write_tokens=cache_write,
            cost=cost
        )
        self.usage_log.append(usage)
        return cost

    def _calculate_cost(self, model, input_tokens, output_tokens, 
                        cache_read, cache_write):
        """Calculate cost based on model pricing."""
        pricing = {
            "claude-opus-4-7": {"input": 5.0, "output": 25.0, "cache_read": 0.50},
            "claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.30},
            "claude-haiku-4-5": {"input": 1.0, "output": 5.0, "cache_read": 0.10},
        }

        p = pricing.get(model, pricing["claude-sonnet-4-6"])

        # Uncached input tokens
        uncached_input = input_tokens - cache_read - cache_write
        input_cost = (uncached_input * p["input"] + 
                      cache_read * p["cache_read"] + 
                      cache_write * p["input"] * 1.25) / 1_000_000

        output_cost = output_tokens * p["output"] / 1_000_000

        return input_cost + output_cost

    def get_daily_summary(self):
        """Get daily cost summary."""
        today = datetime.now().date()
        today_usage = [u for u in self.usage_log if u.timestamp.date() == today]

        return {
            "total_cost": sum(u.cost for u in today_usage),
            "total_requests": len(today_usage),
            "total_input_tokens": sum(u.input_tokens for u in today_usage),
            "total_output_tokens": sum(u.output_tokens for u in today_usage),
            "cache_hit_rate": self._calculate_cache_hit_rate(today_usage)
        }

    def _calculate_cache_hit_rate(self, usage_list):
        """Calculate cache hit rate."""
        total_input = sum(u.input_tokens for u in usage_list)
        total_cache_read = sum(u.cache_read_tokens for u in usage_list)
        return total_cache_read / total_input if total_input > 0 else 0

9. Architecture Patterns

Pattern 1: Tiered Processing

User Request
    ↓
[Classifier] (Haiku - cheap)
    ↓
[Simple Handler] (Haiku) → Response
    ↓
[Complex Handler] (Sonnet/Opus) → Response

Pattern 2: Cache Layer

User Request
    ↓
[Semantic Cache] → Cache Hit? → Return cached response
    ↓ Cache Miss
[Prompt Cache Layer] → Add cache_control markers
    ↓
[LLM API] → Response
    ↓
[Cache Storage] → Store for future

Pattern 3: Batch Pipeline

[Data Source]
    ↓
[Batch Collector] → Accumulate requests
    ↓
[Batch API] → Process asynchronously (50% discount)
    ↓
[Result Distributor] → Send results to users

10. Real-World Case Study

Scenario: Customer support chatbot processing 5,000 conversations/day

Before optimization:

Model: Claude Sonnet 4.6 (fixed)
Average tokens: 3,000 input, 800 output per conversation
Daily cost: $78.00
Monthly cost: ~$2,340

After optimization:

Model routing: 70% Haiku, 30% Sonnet
Prompt caching: 90% cache hit rate on system prompt
Output limits: Reduced average output to 400 tokens
Daily cost: $12.50
Monthly cost: ~$375

Total savings: 84%

11. Provider Agnostic Tips

When working with multiple LLM providers or switching between them:

Abstract your LLM layer: Use a unified interface that makes it easy to switch providers.
Test with multiple providers: Some tasks work equally well with cheaper providers.
Monitor provider-specific features: Prompt caching, batch processing, and pricing vary significantly.
Consider Chinese models: For cost-sensitive applications, Chinese models like DeepSeek and GLM offer significantly lower pricing. Services like Token China provide unified API access to these models with OpenAI-compatible endpoints—no Chinese phone number required, and you get 100K free tokens to start.
Negotiate volume discounts: For high-volume applications, contact providers directly for custom pricing.

12. Checklist

Before deploying to production, verify:

[ ] System prompts are optimized and use prompt caching
[ ] Model routing is implemented for different task types
[ ] Output length limits are set appropriately
[ ] Batch processing is used for async workloads
[ ] Token monitoring and alerting is in place
[ ] Semantic caching is implemented for repetitive queries
[ ] Conversation context is managed efficiently
[ ] Cost budgets and alerts are configured

Resources

Anthropic Prompt Caching Documentation
Anthropic Batch Processing Documentation
OpenAI Pricing
Google AI Pricing
Token China - Unified API for DeepSeek, GLM, and more (OpenAI-compatible)

TL;DR: Used prompt caching (90% savings on cached tokens), model routing (40% average savings), batch processing (50% savings), and output optimization to reduce LLM API costs by 84%. Consider Chinese models like DeepSeek for even cheaper alternatives.

Edit: Fixed formatting

DEV Community

How I Cut My LLM API Costs by 90%: A Battle-Tested Token Optimization Guide

How I Cut My LLM API Costs by 90%: A Battle-Tested Token Optimization Guide

Introduction

1. Understanding the Cost Structure

Current Pricing (May 2026)

Cost Formula

2. Prompt Caching: The Highest-Impact Optimization

How It Works

Anthropic Prompt Caching

Automatic Caching (New in 2026)

OpenAI Prompt Caching

Best Practices for Prompt Caching

3. Model Selection: Right-Size for the Task

Task-Based Routing

Cost Comparison

4. Batch Processing: 50% Savings for Async Workloads

Anthropic Message Batches API

Batch Pricing (50% Discount)

5. Output Optimization

Limit Output Length

Structured Output

Streaming

6. Context Management

Sliding Window

Conversation Summarization

7. Semantic Caching

8. Monitoring and Cost Tracking

9. Architecture Patterns

Pattern 1: Tiered Processing

Pattern 2: Cache Layer

Pattern 3: Batch Pipeline

10. Real-World Case Study

11. Provider Agnostic Tips

12. Checklist

Resources

Top comments (0)