DEV Community

Jesse
Jesse

Posted on • Edited on

How I Cut My LLM API Costs by 90%: A Battle-Tested Token Optimization Guide

How I Cut My LLM API Costs by 90%: A Battle-Tested Token Optimization Guide

Last Updated: May 2026

Audience: Backend engineers, ML engineers, and product developers building LLM-powered applications


Introduction

If you're building production LLM applications, you've probably watched your token costs spiral out of control faster than expected. A moderately successful chatbot can easily burn through $10,000/month, and a high-traffic API integration can hit six figures.

This guide covers battle-tested strategies for reducing token consumption without sacrificing output quality. These techniques are based on industry best practices and can be applied to any LLM provider.

What you'll learn:

  • How to reduce input tokens by 60-90% with prompt caching
  • When to use batch processing for 50% cost savings
  • How to choose the right model for each task
  • Architecture patterns that scale token efficiency

1. Understanding the Cost Structure

Before optimizing, you need to understand where your money goes.

Current Pricing (May 2026)

International Providers:

Provider Model Input (per MTok) Output (per MTok) Context Window
Anthropic Claude Opus 4.7 $5.00 $25.00 1M tokens
Anthropic Claude Sonnet 4.6 $3.00 $15.00 1M tokens
Anthropic Claude Haiku 4.5 $1.00 $5.00 200K tokens
OpenAI GPT-4o $2.50 $10.00 128K tokens
OpenAI GPT-4o-mini $0.15 $0.60 128K tokens
Google Gemini 1.5 Pro $3.50 $10.50 2M tokens

Chinese Providers:

Provider Model Input (per MTok) Output (per MTok) Context Window
DeepSeek DeepSeek V4 Pro $0.14 $0.28 128K tokens
DeepSeek DeepSeek V4 Flash $0.07 $0.14 128K tokens
Zhipu AI GLM 5.1 $0.14 $0.28 128K tokens
Zhipu AI GLM 5V Turbo $0.14 $0.28 128K tokens

Key insight: Output tokens are typically 3-5x more expensive than input tokens. Optimizing output length often has the highest ROI.

Cost Formula

Total Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)
Enter fullscreen mode Exit fullscreen mode

For a typical chatbot conversation (2000 input tokens, 500 output tokens) using DeepSeek V4 Pro:

  • Input cost: 0.002 × $0.14 = $0.00028
  • Output cost: 0.0005 × $0.28 = $0.00014
  • Total: $0.00042 per conversation

At 10,000 conversations/day, that's $4.20/day or ~$126/month.

Cost comparison: The same conversation using Claude Sonnet 4.6 would cost $0.0135—32x more expensive than DeepSeek V4 Pro.


2. Prompt Caching: The Highest-Impact Optimization

Prompt caching is the single most effective cost reduction technique available today. Both Anthropic and OpenAI now support it natively.

How It Works

When you send a request with prompt caching enabled, the provider caches the prefix of your prompt. Subsequent requests with the same prefix reuse the cached version, dramatically reducing both cost and latency.

Anthropic Prompt Caching

Anthropic's implementation offers 90% savings on cached input tokens:

Operation Price (Claude Sonnet 4.6)
Base input $3.00 / MTok
Cache write (5min TTL) $3.75 / MTok (1.25x)
Cache write (1h TTL) $6.00 / MTok (2x)
Cache read $0.30 / MTok (0.1x)

Implementation:

import anthropic

client = anthropic.Anthropic()

# System prompt with cache control
system_prompt = [
    {
        "type": "text",
        "text": """You are an expert code reviewer. Follow these guidelines:
        - Focus on security vulnerabilities
        - Check for performance issues
        - Verify error handling
        - Suggest improvements with code examples

        [Your full system prompt here...]""",
        "cache_control": {"type": "ephemeral"}
    }
]

# First request: cache write
response1 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=system_prompt,
    messages=[{"role": "user", "content": "Review this Python function..."}]
)

# Subsequent requests: cache read (90% cheaper)
response2 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=system_prompt,  # Same system prompt = cache hit
    messages=[{"role": "user", "content": "Review this other function..."}]
)
Enter fullscreen mode Exit fullscreen mode

Real-world savings: A code review tool processing 1,000 requests/day with a 2,000-token system prompt:

  • Without caching: $6.00/day for system prompt tokens
  • With caching: $0.60/day for system prompt tokens
  • Savings: $5.40/day ($162/month)

Automatic Caching (New in 2026)

Anthropic now supports automatic caching for multi-turn conversations:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},  # Auto-cache last block
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi! How can I help?"},
        {"role": "user", "content": "What's the weather?"}
    ]
)
Enter fullscreen mode Exit fullscreen mode

The cache point automatically moves forward as conversations grow. No manual breakpoint management needed.

OpenAI Prompt Caching

OpenAI automatically caches prompts longer than 1,024 tokens (for most models). Cached input tokens are billed at 50% of the standard rate.

from openai import OpenAI

client = OpenAI()

# OpenAI automatically caches long prompts
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Your long system prompt here..."},
        {"role": "user", "content": "Your query here"}
    ]
)
Enter fullscreen mode Exit fullscreen mode

No code changes required—caching happens automatically.

Best Practices for Prompt Caching

  1. Place static content first: System prompts, tool definitions, and context should come before dynamic content.

  2. Use explicit breakpoints strategically: For multi-section prompts, place cache_control on sections that change at different frequencies.

  3. Pre-warm caches: Send a "warmup" request before users arrive to eliminate first-request latency.

# Pre-warm cache before users arrive
prewarm = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=0,  # No output needed
    system=[
        {
            "type": "text",
            "text": "Your system prompt...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "warmup"}]
)
Enter fullscreen mode Exit fullscreen mode
  1. Monitor cache hit rates: Track cache_read_input_tokens and cache_creation_input_tokens in API responses.

3. Model Selection: Right-Size for the Task

Not every task needs the most expensive model. Implement a routing system that matches tasks to appropriate models.

Task-Based Routing

import anthropic

def route_task(task_type: str, complexity: int) -> str:
    """Route tasks to appropriate models based on type and complexity."""

    routing_table = {
        # Simple tasks: use cheapest model
        "classification": {"low": "claude-haiku-4-5", "high": "claude-sonnet-4-6"},
        "summarization": {"low": "claude-haiku-4-5", "high": "claude-sonnet-4-6"},
        "translation": {"low": "claude-haiku-4-5", "high": "claude-sonnet-4-6"},

        # Complex tasks: use capable model
        "code_generation": {"low": "claude-sonnet-4-6", "high": "claude-opus-4-7"},
        "reasoning": {"low": "claude-sonnet-4-6", "high": "claude-opus-4-7"},
        "analysis": {"low": "claude-sonnet-4-6", "high": "claude-opus-4-7"},
    }

    complexity_level = "high" if complexity > 7 else "low"
    return routing_table.get(task_type, {}).get(complexity_level, "claude-sonnet-4-6")


def call_llm(prompt: str, task_type: str, complexity: int):
    """Call LLM with appropriate model based on task."""
    client = anthropic.Anthropic()
    model = route_task(task_type, complexity)

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

    return response.content[0].text
Enter fullscreen mode Exit fullscreen mode

Cost Comparison

For a typical application with mixed tasks:

Task Distribution Fixed (Sonnet) Routed (Mixed) Savings
60% simple tasks $3.00/MTok $1.00/MTok 67%
30% medium tasks $3.00/MTok $3.00/MTok 0%
10% complex tasks $3.00/MTok $5.00/MTok -67%
Weighted average $3.00/MTok $1.80/MTok 40%

4. Batch Processing: 50% Savings for Async Workloads

For tasks that don't require immediate responses, batch processing offers 50% cost savings.

Anthropic Message Batches API

import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

client = anthropic.Anthropic()

# Create batch
batch = client.messages.batches.create(
    requests=[
        Request(
            custom_id=f"review-{i}",
            params=MessageCreateParamsNonStreaming(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=[{"role": "user", "content": f"Review code snippet {i}..."}]
            )
        )
        for i in range(100)  # 100 requests in one batch
    ]
)

# Poll for results
import time
while True:
    batch_status = client.messages.batches.retrieve(batch.id)
    if batch_status.processing_status == "ended":
        break
    time.sleep(60)

# Process results
for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        print(f"{result.custom_id}: {result.result.message.content[0].text}")
Enter fullscreen mode Exit fullscreen mode

Batch Pricing (50% Discount)

Model Standard Input Batch Input Standard Output Batch Output
Claude Opus 4.7 $5.00 $2.50 $25.00 $12.50
Claude Sonnet 4.6 $3.00 $1.50 $15.00 $7.50
Claude Haiku 4.5 $1.00 $0.50 $5.00 $2.50

When to use batch processing:

  • Large-scale evaluations
  • Content moderation
  • Data analysis
  • Bulk content generation
  • Code review pipelines

Combining batch + caching: You can stack batch processing with prompt caching for up to 95% savings on input tokens (50% batch + 90% cache read).


5. Output Optimization

Since output tokens are 3-5x more expensive than input tokens, optimizing output length has high ROI.

Limit Output Length

# Explicit token limit
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,  # Limit output
    messages=[{"role": "user", "content": "Summarize this article in 3 bullet points."}]
)

# Prompt-based length control
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Explain quantum computing. Keep it under 100 words."
    }]
)
Enter fullscreen mode Exit fullscreen mode

Structured Output

Request structured output to reduce verbose explanations:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": """Analyze this code. Return JSON:
        {
            "issues": ["issue1", "issue2"],
            "severity": "high|medium|low",
            "suggestions": ["suggestion1", "suggestion2"]
        }"""
    }]
)
Enter fullscreen mode Exit fullscreen mode

Streaming

Streaming doesn't reduce token costs, but it improves perceived latency:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a function..."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

6. Context Management

Long conversations accumulate tokens quickly. Implement strategies to manage context efficiently.

Sliding Window

class ConversationManager:
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.messages = []

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._trim()

    def _trim(self):
        """Keep system prompt + recent messages within token budget."""
        total = self._count_tokens()
        while total > self.max_tokens and len(self.messages) > 2:
            # Remove oldest message (preserve system prompt)
            removed = self.messages.pop(1)
            total -= self._count_tokens([removed])

    def _count_tokens(self, messages=None):
        """Estimate token count (simplified)."""
        msgs = messages or self.messages
        return sum(len(m["content"]) // 4 for m in msgs)  # Rough estimate
Enter fullscreen mode Exit fullscreen mode

Conversation Summarization

For very long conversations, periodically summarize:

def summarize_conversation(messages: list) -> list:
    """Compress long conversation into summary."""
    client = anthropic.Anthropic()

    summary_response = client.messages.create(
        model="claude-haiku-4-5",  # Use cheap model for summarization
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation in 50 words:\n{format_messages(messages)}"
        }]
    )

    summary = summary_response.content[0].text

    return [
        {"role": "system", "content": f"Previous conversation summary: {summary}"},
        messages[-1]  # Keep last message for context
    ]
Enter fullscreen mode Exit fullscreen mode

7. Semantic Caching

For applications with repetitive queries, implement semantic caching to avoid redundant API calls.

import hashlib
import json
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}
        self.threshold = similarity_threshold

    def get(self, query: str):
        """Find semantically similar cached response."""
        query_embedding = self.model.encode(query)

        for cached_query, (cached_embedding, response) in self.cache.items():
            similarity = np.dot(query_embedding, cached_embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
            )
            if similarity > self.threshold:
                return response

        return None

    def set(self, query: str, response: str):
        """Cache response with query embedding."""
        embedding = self.model.encode(query)
        self.cache[query] = (embedding, response)

# Usage
cache = SemanticCache()

def get_llm_response(query: str) -> str:
    # Check cache first
    cached = cache.get(query)
    if cached:
        return cached

    # Call LLM if not cached
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    )

    result = response.content[0].text
    cache.set(query, result)
    return result
Enter fullscreen mode Exit fullscreen mode

8. Monitoring and Cost Tracking

You can't optimize what you don't measure. Implement comprehensive token monitoring.

import time
from dataclasses import dataclass
from datetime import datetime

@dataclass
class TokenUsage:
    timestamp: datetime
    model: str
    input_tokens: int
    output_tokens: int
    cache_read_tokens: int
    cache_write_tokens: int
    cost: float

class TokenMonitor:
    def __init__(self):
        self.usage_log = []

    def log(self, model: str, input_tokens: int, output_tokens: int,
            cache_read: int = 0, cache_write: int = 0):
        """Log token usage with cost calculation."""
        cost = self._calculate_cost(model, input_tokens, output_tokens, 
                                     cache_read, cache_write)

        usage = TokenUsage(
            timestamp=datetime.now(),
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cache_read_tokens=cache_read,
            cache_write_tokens=cache_write,
            cost=cost
        )
        self.usage_log.append(usage)
        return cost

    def _calculate_cost(self, model, input_tokens, output_tokens, 
                        cache_read, cache_write):
        """Calculate cost based on model pricing."""
        pricing = {
            "claude-opus-4-7": {"input": 5.0, "output": 25.0, "cache_read": 0.50},
            "claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.30},
            "claude-haiku-4-5": {"input": 1.0, "output": 5.0, "cache_read": 0.10},
        }

        p = pricing.get(model, pricing["claude-sonnet-4-6"])

        # Uncached input tokens
        uncached_input = input_tokens - cache_read - cache_write
        input_cost = (uncached_input * p["input"] + 
                      cache_read * p["cache_read"] + 
                      cache_write * p["input"] * 1.25) / 1_000_000

        output_cost = output_tokens * p["output"] / 1_000_000

        return input_cost + output_cost

    def get_daily_summary(self):
        """Get daily cost summary."""
        today = datetime.now().date()
        today_usage = [u for u in self.usage_log if u.timestamp.date() == today]

        return {
            "total_cost": sum(u.cost for u in today_usage),
            "total_requests": len(today_usage),
            "total_input_tokens": sum(u.input_tokens for u in today_usage),
            "total_output_tokens": sum(u.output_tokens for u in today_usage),
            "cache_hit_rate": self._calculate_cache_hit_rate(today_usage)
        }

    def _calculate_cache_hit_rate(self, usage_list):
        """Calculate cache hit rate."""
        total_input = sum(u.input_tokens for u in usage_list)
        total_cache_read = sum(u.cache_read_tokens for u in usage_list)
        return total_cache_read / total_input if total_input > 0 else 0
Enter fullscreen mode Exit fullscreen mode

9. Architecture Patterns

Pattern 1: Tiered Processing

User Request
    ↓
[Classifier] (Haiku - cheap)
    ↓
[Simple Handler] (Haiku) → Response
    ↓
[Complex Handler] (Sonnet/Opus) → Response
Enter fullscreen mode Exit fullscreen mode

Pattern 2: Cache Layer

User Request
    ↓
[Semantic Cache] → Cache Hit? → Return cached response
    ↓ Cache Miss
[Prompt Cache Layer] → Add cache_control markers
    ↓
[LLM API] → Response
    ↓
[Cache Storage] → Store for future
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Batch Pipeline

[Data Source]
    ↓
[Batch Collector] → Accumulate requests
    ↓
[Batch API] → Process asynchronously (50% discount)
    ↓
[Result Distributor] → Send results to users
Enter fullscreen mode Exit fullscreen mode

10. Real-World Case Study

Scenario: Customer support chatbot processing 5,000 conversations/day

Before optimization:

  • Model: Claude Sonnet 4.6 (fixed)
  • Average tokens: 3,000 input, 800 output per conversation
  • Daily cost: $78.00
  • Monthly cost: ~$2,340

After optimization:

  • Model routing: 70% Haiku, 30% Sonnet
  • Prompt caching: 90% cache hit rate on system prompt
  • Output limits: Reduced average output to 400 tokens
  • Daily cost: $12.50
  • Monthly cost: ~$375

Total savings: 84%


11. Provider Agnostic Tips

When working with multiple LLM providers or switching between them:

  1. Abstract your LLM layer: Use a unified interface that makes it easy to switch providers.

  2. Test with multiple providers: Some tasks work equally well with cheaper providers.

  3. Monitor provider-specific features: Prompt caching, batch processing, and pricing vary significantly.

  4. Consider Chinese models: For cost-sensitive applications, Chinese models like DeepSeek and GLM offer significantly lower pricing. Services like Token China provide unified API access to these models with OpenAI-compatible endpoints—no Chinese phone number required, and you get 100K free tokens to start.

  5. Negotiate volume discounts: For high-volume applications, contact providers directly for custom pricing.


12. Checklist

Before deploying to production, verify:

  • [ ] System prompts are optimized and use prompt caching
  • [ ] Model routing is implemented for different task types
  • [ ] Output length limits are set appropriately
  • [ ] Batch processing is used for async workloads
  • [ ] Token monitoring and alerting is in place
  • [ ] Semantic caching is implemented for repetitive queries
  • [ ] Conversation context is managed efficiently
  • [ ] Cost budgets and alerts are configured

Resources


TL;DR: Used prompt caching (90% savings on cached tokens), model routing (40% average savings), batch processing (50% savings), and output optimization to reduce LLM API costs by 84%. Consider Chinese models like DeepSeek for even cheaper alternatives.


Edit: Fixed formatting

Top comments (0)