If you're building AI-powered applications, you've probably noticed your API bills climbing faster than your user growth. With frontier models like Claude Opus 4.5 ($5/$25 per 1M tokens) and GPT-5.2 Pro ($21/$168 per 1M tokens), even moderate usage can cost thousands per month.
After analyzing production workloads from enterprise customers, we discovered that 30-43% of API costs stem from suboptimal routing and unnecessarily verbose prompts. Here's how we built an API middleware layer that eliminates this waste while maintaining 91.94% accuracy in task classification.
The Cost Problem
Let's look at a typical developer workflow:
// Common pattern: Send everything to the flagship model
const response = await anthropic.messages.create({
model: "claude-opus-4.5",
max_tokens: 4096,
messages: [{
role: "user",
content: "Summarize this customer email..." // Simple task
}]
});
Cost for 100 requests/day: ~$180/month
The issue? You're using a $25/M output token model for a task that Claude Haiku ($5/M) could handle equally well.
The Three-Layer Architecture
We built Prompt Optimizer API as a transparent middleware layer that sits between your application and LLM providers. It operates on three levels:
Layer 1: Intelligent Caching (10% savings)
The first layer identifies duplicate or near-duplicate requests:
// Prompt Optimizer API automatically detects duplicates
const cachedResponse = await cache.lookup(
hashPrompt(userMessage, { ignoreMinorVariations: true })
);
if (cachedResponse && cachedResponse.age < MAX_CACHE_AGE) {
return cachedResponse; // Zero cost
}
How it works:
- Semantic hashing of prompts (not just string matching)
- TTL-based invalidation for time-sensitive content
- Automatic cache warming for common patterns
Real-world impact: Customer support applications with FAQ-style queries see 15-20% cache hit rates, translating to 10% cost reduction on average.
Layer 2: Tiered Model Routing (30-40% savings)
The core innovation is context detection. We trained a lightweight classifier (91.94% accuracy) that routes requests to the optimal model tier:
interface RoutingDecision {
complexity: 'simple' | 'moderate' | 'complex';
recommendedModel: string;
confidenceScore: number;
}
const decision = await classifier.analyze(prompt);
const modelMap = {
simple: 'claude-haiku-4.5', // $1/$5 per 1M
moderate: 'claude-sonnet-4.5', // $3/$15 per 1M
complex: 'claude-opus-4.5' // $5/$25 per 1M
};
const response = await llm.generate({
model: modelMap[decision.complexity],
prompt: prompt
});
Classification criteria:
- Token count and structural complexity
- Presence of reasoning keywords ("analyze", "evaluate", "design")
- Code generation vs. text generation
- Domain specificity (legal, medical, general)
Real-world impact: 30-40% of requests route to cheaper models, saving $50-80 per $200 baseline spend.
Layer 3: Prompt Optimization (Remaining 50% improvement)
For requests that must go to flagship models, we optimize the prompt itself:
// Before optimization
const verbosePrompt = `
Please analyze this code and tell me what it does.
I need you to be very detailed and thorough.
Make sure you explain every part carefully.
${codeSnippet}
`;
// After optimization (automatic)
const optimizedPrompt = `Analyze this code:\n\n${codeSnippet}`;
Optimization techniques:
- Instruction compression: Remove redundant phrasing
- Context pruning: Strip unnecessary metadata
- Format standardization: Use efficient prompt templates
- Token-aware truncation: Smart context window management
Real-world impact: 20-30% token reduction on the remaining 50% of requests routed to flagship models.
Total Savings Calculation
Here's how the layers compound:
Baseline cost: $200/month
├─ 10% cached (free) → $20 saved
├─ 30-40% to cheaper models → $60-80 saved
└─ 50% optimized but flagship → $6-12 saved (token reduction)
Total savings: $86/month (43%)
Final cost: $114/month
Integration Guide
Option 1: Drop-in Replacement (Simplest)
Replace your LLM SDK initialization:
// Before
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
// After (with Prompt Optimizer)
import { PromptOptimizer } from '@promptoptimizer/sdk';
const anthropic = new PromptOptimizer({
apiKey: process.env.PROMPT_OPTIMIZER_KEY,
provider: 'anthropic',
fallbackKey: process.env.ANTHROPIC_API_KEY
});
// Same API surface - zero code changes needed
const response = await anthropic.messages.create({
model: "claude-opus-4.5", // May be downgraded automatically
messages: [{ role: "user", content: "..." }]
});
Option 2: API Gateway Pattern (Enterprise)
Deploy as a reverse proxy:
# docker-compose.yml
services:
prompt-optimizer:
image: promptoptimizer/gateway:latest
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- CACHE_BACKEND=redis
- CACHE_TTL=3600
ports:
- "8080:8080"
redis:
image: redis:7-alpine
volumes:
- cache-data:/data
volumes:
cache-data:
Route all LLM traffic through the gateway:
// Configure SDK to use local gateway
const anthropic = new Anthropic({
baseURL: 'http://localhost:8080/v1/anthropic',
apiKey: process.env.ANTHROPIC_API_KEY
});
Option 3: Kubernetes Sidecar (Cloud-Native)
apiVersion: v1
kind: Pod
metadata:
name: ai-app
spec:
containers:
- name: app
image: your-app:latest
env:
- name: LLM_ENDPOINT
value: "http://localhost:8080"
- name: prompt-optimizer
image: promptoptimizer/sidecar:latest
ports:
- containerPort: 8080
env:
- name: UPSTREAM_PROVIDERS
value: "anthropic,openai,google"
- name: CACHE_MODE
value: "distributed"
Monitoring and Observability
The system exposes metrics for cost tracking:
// Built-in analytics
const stats = await optimizer.getStats();
console.log(stats);
/*
{
totalRequests: 10000,
cacheHitRate: 0.12,
routingBreakdown: {
simple: 0.35, // → Haiku
moderate: 0.40, // → Sonnet
complex: 0.25 // → Opus
},
costSavings: {
baseline: 245.60,
actual: 139.99,
savedPercentage: 43.0
}
}
*/
Use Cases
1. Customer Support Automation
// Route based on query complexity
const supportBot = new PromptOptimizer({
provider: 'anthropic',
routingStrategy: {
faq: 'claude-haiku-4.5', // Simple lookups
triage: 'claude-sonnet-4.5', // Classification
escalation: 'claude-opus-4.5' // Complex issues
}
});
Typical savings: 45-50% (high FAQ volume)
2. CI/CD Code Review
// Optimize for batch processing
const codeReviewer = new PromptOptimizer({
provider: 'openai',
batchMode: true,
caching: { enabled: true, ttl: 86400 }, // Cache by file hash
routing: 'complexity-based'
});
for (const file of changedFiles) {
await codeReviewer.review(file); // Smart routing per file
}
Typical savings: 35-40% (many simple linting-style reviews)
3. Multi-Model RAG Pipeline
// Use cheapest model for retrieval, flagship for synthesis
const rag = new PromptOptimizer({
steps: [
{ task: 'embed', model: 'text-embedding-3-small' },
{ task: 'rerank', model: 'claude-haiku-4.5' },
{ task: 'synthesize', model: 'claude-opus-4.5' }
]
});
Typical savings: 40-45% (optimization at each stage)
Performance Characteristics
| Metric | Value |
|---|---|
| Routing latency overhead | 12-18ms (p95) |
| Classification accuracy | 91.94% |
| Cache hit rate (typical) | 10-15% |
| False downgrades | <3% (quality monitoring) |
Security and Privacy
- Zero data retention: Prompts are not logged or stored
- End-to-end encryption: TLS 1.3 for all traffic
- SOC 2 Type II compliant: Annual audits
- GDPR/CCPA ready: No PII processing
Cost Calculator
Want to estimate your savings? We built an interactive calculator as a Reddit Devvit app:
- Live demo: r/cost_calculator_dev
- Source code: GitHub
The calculator uses real-world pricing data updated weekly via automated Perplexity tasks.
Getting Started
# Install SDK
npm install @promptoptimizer/sdk
# Or use Docker
docker pull promptoptimizer/gateway:latest
# Self-hosted (open source core)
git clone https://github.com/promptoptimizer/core
cd core && docker-compose up
Pricing:
- Free tier: 10K requests/month
- Pro: $49/month (500K requests)
- Enterprise: Custom (self-hosted or dedicated)
Conclusion
By treating LLM API routing as a systems problem rather than a prompt engineering problem, we've achieved:
- 43% cost reduction for heavy users
- 30% savings for development teams
- 91.94% accuracy in task classification
- <20ms latency overhead
The three-layer architecture (caching, tiered routing, optimization) works because modern frontier models are often over-provisioned for the task at hand. A $25/M output token model is incredible for research and complex reasoning, but overkill for "summarize this email."
Smart routing isn't about sacrificing quality—it's about matching the right tool to the job.
About the Author: Built by the Prompt Optimizer team. We're ex-Google/Meta engineers obsessed with making AI infrastructure more efficient.
Learn more: promptoptimizer-blog.vercel.app
Try the cost calculator: reddit.com/r/cost_calculator_dev
Top comments (0)