Originally published on ScaleMind Blog
An AI gateway is the fastest path to LLM cost optimization, but you can start cutting costs in minutes with these 5 tested strategies in 2026.
TL;DR
If you're spending over $1k/month on LLM APIs without optimization, you're overpaying by at least 40%. This guide covers five strategies you can implement today:
| Strategy | Time to Implement | Expected Savings |
|---|---|---|
| Prompt Caching | 10 minutes | 50-90% on cached tokens |
| Model Routing | 2-4 hours | 20-60% |
| Semantic Caching | 1-2 hours | 15-30% |
| Batch Processing | 30 minutes | 50% on async workloads |
| AI Gateway | 5 minutes | 40-70% combined |
Why Are LLM Costs Spiraling Out of Control?
LLM costs scale linearly with usage, but most teams don't notice the bleeding until it's too late. The problem isn't model pricing, it's routing every request to the most expensive model regardless of task complexity.
Here's the current cost landscape for 1 million input tokens (December 2025):
| Model Family | Model | Cost / 1M Input | Best For |
|---|---|---|---|
| Frontier | GPT-4o | $2.50 | Complex reasoning, coding |
| Frontier | Claude 3.5 Sonnet | $3.00 | Nuanced writing, RAG |
| Efficient | GPT-4o Mini | $0.15 | Summarization, classification |
| Efficient | Claude 3.5 Haiku | $0.80 | Speed-critical tasks |
| Ultra-Low | Gemini 1.5 Flash-8B | $0.0375 | Bulk processing, extraction |
Sources: OpenAI Pricing, Anthropic Pricing, Google AI Pricing
The math is brutal: Processing 100M tokens/month with Claude 3.5 Sonnet costs ~$300 in input tokens alone. Route 50% of that to Gemini 1.5 Flash? That portion drops from $150 to $1.88. The savings compound fast.
This guide is for engineering teams ready to stop the bleeding today-not next quarter.
How Does Prompt Caching Reduce LLM Costs?
Prompt caching stores frequently-used context (system prompts, RAG documents, few-shot examples) so you don't pay full price every time you resend them. Both OpenAI and Anthropic offer native prompt caching with substantial discounts.
When it works best:
- Chatbots - System prompt + conversation history sent repeatedly
- RAG applications - Same documents analyzed across multiple queries
- Coding assistants - Full codebase context included with every request
Implementation: Anthropic Prompt Caching
# Before: Paying full price for the same 10k-token context every request
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system="You are a legal analyst assistant. " + large_context,
messages=[{"role": "user", "content": query}]
)
# After: 90% discount on cached tokens
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a legal analyst assistant.",
"cache_control": {"type": "ephemeral"} # Cache this block
},
{
"type": "text",
"text": large_context, # Your 10k-token document
"cache_control": {"type": "ephemeral"} # Cached at 90% discount
}
],
messages=[{"role": "user", "content": query}]
)
# Check your savings
print(f"Cache Creation: {response.usage.cache_creation_input_tokens} tokens")
print(f"Cache Read: {response.usage.cache_read_input_tokens} tokens (90% off)")
Savings breakdown:
| Provider | Cache Discount | Notes |
|---|---|---|
| Anthropic | ~90% on reads | Explicit cache_control header required |
| OpenAI | ~50% on reads | Automatic for prompts > 1,024 tokens |
Time to implement: 10 minutes for existing prompts.
See the Anthropic Prompt Caching docs for minimum token requirements and TTL details.
How Does Model Routing Cut Costs Without Sacrificing Quality?
Model routing directs each request to the cheapest model capable of handling it. Using GPT-4o for password reset instructions is like hiring a PhD to answer the phone-expensive and unnecessary.
The routing principle:
| Task Type | Use This Model | Why |
|---|---|---|
| Creative writing, complex reasoning, coding | GPT-4o, Claude 3.5 Sonnet | Requires frontier intelligence |
| Summarization, classification, extraction | GPT-4o Mini, Haiku | 10-20x cheaper, same quality |
| Bulk data processing | Gemini Flash, open-source | Sub-penny per request |
Implementation: Basic Complexity Router
def classify_complexity(prompt: str) -> str:
"""
Simple heuristic router. Production systems often use:
- A small classifier model (BERT, DistilBERT)
- Keyword/regex matching
- Token count thresholds
"""
complexity_signals = ["code", "reason", "analyze", "compare", "debug"]
if len(prompt) > 2000:
return "complex"
if any(signal in prompt.lower() for signal in complexity_signals):
return "complex"
return "simple"
def route_request(prompt: str):
complexity = classify_complexity(prompt)
if complexity == "simple":
# 94% cheaper than GPT-4o
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
else:
return client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
# Real-world scenario:
# - 70% of requests are simple (FAQ, summaries, classification)
# - 30% require frontier models
#
# Cost calculation:
# Before: 100% at $2.50 = $2.50 avg
# After: (0.7 × $0.15) + (0.3 × $2.50) = $0.855 avg
# Savings: 66%
Expected savings: 20-60% depending on your traffic mix. Most B2B apps see 60-70% of requests routable to efficient models.
What is Semantic Caching and How Does It Work?
Semantic caching uses vector embeddings to recognize that "How do I reset my password?" and "I forgot my password, help!" are the same question. Instead of hitting the LLM again, it returns the cached response-zero API cost.
Standard Redis caching only works on exact string matches. Semantic caching works on meaning.
Implementation: LangChain + Redis
from langchain.cache import RedisSemanticCache
from langchain.globals import set_llm_cache
from langchain_openai import OpenAIEmbeddings
# One-time setup
set_llm_cache(RedisSemanticCache(
redis_url="redis://localhost:6379",
embedding=OpenAIEmbeddings(),
score_threshold=0.2 # Lower = stricter matching
))
# First request: hits API, costs money, stores response
response_1 = llm.invoke("What's the refund policy?")
# Second request: semantically similar, returns from cache ($0)
response_2 = llm.invoke("How can I get my money back?")
# Third request: different enough, hits API
response_3 = llm.invoke("What products do you sell?")
When semantic caching shines:
- Customer support bots (20-40% query overlap typical)
- FAQ-style applications
- Search result explanations
- Any high-repetition use case
Expected savings: If 20% of your queries are semantically similar, you save 20% immediately. The embedding lookup cost is negligible (~$0.02/1M tokens with text-embedding-3-small).
For a deeper dive, see our Semantic Caching for LLMs guide.
When Should You Use Batch Processing for LLM Requests?
Batch processing is the right choice for any workload that doesn't need real-time responses. OpenAI's Batch API offers a flat 50% discount for requests that can wait up to 24 hours (they usually complete within minutes).
Ideal batch candidates:
- Nightly content generation
- Sentiment analysis on yesterday's support tickets
- Bulk document summarization
- Evaluation and testing pipelines
Implementation: OpenAI Batch API
from openai import OpenAI
client = OpenAI()
# Step 1: Create JSONL file with requests
# batch_requests.jsonl:
# {"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize: ..."}]}}
# {"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize: ..."}]}}
# Step 2: Upload the file
batch_file = client.files.create(
file=open("batch_requests.jsonl", "rb"),
purpose="batch"
)
# Step 3: Create the batch job (50% discount tier)
batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch {batch_job.id} created. Status: {batch_job.status}")
# GPT-4o batch pricing:
# - Standard: $2.50/1M input, $10.00/1M output
# - Batch: $1.25/1M input, $5.00/1M output
# Savings: 50% flat
Expected savings: 50% on all eligible workloads. If 30% of your LLM usage is async, that's 15% off your total bill.
See OpenAI's Batch API cookbook for error handling and retrieval patterns.
What is an AI Gateway and Why Does It Matter?
An AI gateway is a proxy layer between your application and LLM providers that handles routing, caching, fallbacks, and cost optimization automatically. Instead of implementing the four strategies above separately, a gateway gives you all of them out of the box.
What an AI gateway handles:
| Feature | DIY Effort | With Gateway |
|---|---|---|
| Model routing | Custom classifier + routing logic | Config file |
| Prompt caching | Provider-specific implementation | Automatic |
| Semantic caching | Redis + embeddings + maintenance | Built-in |
| Failover (OpenAI down → Anthropic) | Complex error handling | Automatic |
| Cost tracking | Custom logging + dashboards | Real-time UI |
The tradeoff: You're adding a dependency. The benefit is shipping faster and not maintaining infrastructure that isn't your core product.
# Before: Direct OpenAI call (no caching, no fallback, no cost tracking)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
# After: ScaleMind gateway (same API, automatic optimization)
from scalemind import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o", # Routes to cheapest capable model
messages=[{"role": "user", "content": prompt}]
)
# Automatic: caching, fallback to Anthropic if OpenAI fails, cost logging
Expected savings: 40-70% combined, depending on workload characteristics.
Building an AI-powered frontend? Tools like Forge can generate the UI in minutes while an AI gateway handles your backend cost optimization.
Read more: What is an AI Gateway? and AI Gateway vs API Gateway: What's the Difference?
Case Studies: Real Companies, Real Savings
Jellypod: 88% Cost Reduction
Problem: Jellypod converts newsletters into podcasts. They were using GPT-4 for every summarization task, burning cash as usage scaled.
Solution: Implemented model routing (Strategy 2) and fine-tuned a smaller Mistral model for their specific summarization task.
Result: Inference costs dropped from ~$10/1M tokens to ~$1.20/1M tokens-an 88% reduction without quality loss for their use case.
Supernormal: 80% Cost Reduction
Problem: Supernormal's AI meeting note-taker faced spiraling costs as user growth exploded.
Solution: Moved to specialized fine-tuned infrastructure, optimized prompt context length, and implemented intelligent routing.
Result: 80% cost reduction, enabling them to scale to thousands of daily meetings without linear cost growth.
Source: Confident AI Case Study
The 24-Hour Implementation Checklist
Here's your action plan, prioritized by effort-to-impact ratio.
Hours 0-2: Audit and Baseline
- [ ] Export usage logs from OpenAI/Anthropic dashboards
- [ ] Identify your top 3 most expensive prompts (longest or most frequent)
- [ ] Calculate current cost-per-user or cost-per-request baseline
Hours 2-4: Quick Wins (No Code)
- [ ] Move all background jobs to Batch API (50% savings, 30 min work)
- [ ] Switch obvious low-stakes features to
gpt-4o-mini - [ ] Review system prompts-can any be shortened?
Hours 4-8: Code Changes
- [ ] Enable prompt caching on all system prompts > 1,024 tokens (Anthropic) or let it auto-enable (OpenAI)
- [ ] Set up semantic caching with Redis if you use LangChain
Hours 8-24: Routing Infrastructure
- [ ] Build and deploy
classify_complexity()router - [ ] Start at 30% traffic to cheaper models, monitor quality
- [ ] Increase routing percentage as confidence grows
What Results Can You Expect?
| Optimization Level | Strategies Implemented | Typical Savings |
|---|---|---|
| Basic | Prompt caching + batch API | 30-40% |
| Intermediate | + Model routing | 50-60% |
| Advanced | + Semantic caching + AI gateway | 60-70% |
ROI example: A startup spending $5,000/month on LLM APIs implements basic + routing optimizations. At 50% savings, that's $30,000/year back in the budget-from one day of engineering work.
Key Takeaways
- Start with prompt caching. It's 10 minutes of work for immediate savings on any repeated context.
- Route by complexity. Most production traffic doesn't need GPT-4o. Build a simple classifier and start at 30% routing.
- Batch everything async. If it can wait 24 hours, it should use the Batch API (50% off).
- Semantic caching compounds. High-repetition use cases (support, FAQ, search) see 20%+ savings.
- Gateways handle the complexity. If you don't want to maintain routing/caching infrastructure, tools like ScaleMind handle it automatically.
The tools to cut your bill in half exist right now. You don't need to wait for GPT-5 to lower prices.
Try ScaleMind for automated cost optimization →
Resources
- OpenAI API Pricing (Official)
- Anthropic Prompt Caching Documentation
- OpenAI Batch API Cookbook
- ScaleMind: What is an AI Gateway?
- ScaleMind: AI Gateway vs API Gateway
- ScaleMind: Semantic Caching for LLMs
- DesignRevision: Building AI-Powered Applications
Read more on the ScaleMind Blog:
Try ScaleMind for automated cost optimization → scalemind.ai
Top comments (0)