LLM Cost Optimizer
LLM API costs compound fast — a prototype that costs $5/day can become $500/day in production. This toolkit gives you the instrumentation and strategies to cut LLM spending by 40-70% without sacrificing output quality. Token usage tracking, intelligent model routing, semantic caching, batch processing, and budget alerts — all in one package.
Key Features
- Token Usage Tracking — Instrument every LLM call with precise input/output token counts, costs, and latency per model, user, and feature
- Smart Model Routing — Automatically route simple queries to cheap models (GPT-4o-mini) and complex queries to powerful models (GPT-4o) based on task complexity scoring
- Semantic Caching — Cache responses by semantic similarity, not just exact match. "What's the weather in NYC?" and "NYC weather today?" hit the same cache entry
- Batch Processing — Queue non-urgent requests and process them in bulk at 50% lower cost using batch APIs
- Budget Alerting — Set daily/weekly/monthly spend limits with Slack/email notifications and automatic circuit breakers
- Prompt Compression — Automatically shorten prompts by removing redundant context while preserving meaning
- Cost Forecasting — Project future costs based on usage trends and planned feature launches
Quick Start
from cost_optimizer import CostTracker, ModelRouter, SemanticCache
# 1. Wrap your LLM client with cost tracking
tracker = CostTracker(
storage="sqlite:///costs.db",
alert_threshold_daily=50.00, # Alert at $50/day
alert_channel="slack",
)
# 2. Set up model routing
router = ModelRouter(
rules=[
{"complexity": "low", "model": "gpt-4o-mini", "max_tokens": 500},
{"complexity": "medium", "model": "gpt-4o-mini", "max_tokens": 2000},
{"complexity": "high", "model": "gpt-4o", "max_tokens": 4000},
],
complexity_classifier="keyword", # keyword | llm | embedding
)
# 3. Add semantic caching
cache = SemanticCache(
backend="redis",
embedding_model="text-embedding-3-small",
similarity_threshold=0.92,
ttl_seconds=86400,
)
# 4. Use in your application
@tracker.track(feature="customer_support")
@cache.cached()
def answer_question(question: str) -> str:
model = router.select_model(question)
response = llm_client.chat(model=model, messages=[{"role": "user", "content": question}])
return response.content
answer = answer_question("What's your return policy?")
Architecture
Application Request
│
▼
┌───────────────┐
│ Cost Tracker │──── Log request metadata (pre-call)
└───────┬───────┘
▼
┌───────────────┐
│ Semantic Cache│──── Hit? Return cached response
└───────┬───────┘
│ Miss
▼
┌───────────────┐
│ Model Router │──── Score complexity → select model
└───────┬───────┘
▼
┌───────────────┐
│ Prompt Compress│──── Shorten prompt if over budget
└───────┬───────┘
│
├── Urgent ──▶ Direct API call
│
└── Non-urgent ──▶ Batch queue (50% cheaper)
│
┌─────────────────────┘
▼
┌───────────────┐
│ Cost Tracker │──── Log tokens, cost, latency (post-call)
└───────┬───────┘
▼
┌───────────────┐
│ Budget Check │──── Over limit? Alert + circuit break
└───────────────┘
Usage Examples
Cost Dashboard Queries
from cost_optimizer import CostTracker
tracker = CostTracker(storage="sqlite:///costs.db")
# Daily cost breakdown by model
daily = tracker.report(period="today", group_by="model")
for row in daily:
print(f"{row.model}: {row.requests} reqs, {row.tokens:,} tokens, ${row.cost:.2f}")
# Weekly trend by feature
weekly = tracker.report(period="7d", group_by="feature")
for row in weekly:
print(f"{row.feature}: ${row.cost:.2f} ({row.cost_change:+.1%} vs last week)")
# Identify most expensive queries
expensive = tracker.top_queries(period="24h", limit=10, sort_by="cost")
for q in expensive:
print(f"${q.cost:.3f} | {q.model} | {q.prompt_preview[:80]}...")
Batch Processing for Non-Urgent Requests
from cost_optimizer.batch import BatchQueue
queue = BatchQueue(
provider="openai",
check_interval=300, # Check for results every 5 minutes
max_batch_size=1000,
)
# Queue requests (returns immediately)
job_ids = []
for question in faq_questions:
job_id = queue.enqueue(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}],
)
job_ids.append(job_id)
# Process batch (50% cheaper than individual calls)
results = queue.process_batch()
for job_id, response in zip(job_ids, results):
print(f"{job_id}: {response.content[:100]}")
Prompt Compression
from cost_optimizer.compression import PromptCompressor
compressor = PromptCompressor(
method="extractive", # extractive | abstractive
target_ratio=0.6, # Reduce prompt to 60% of original length
preserve_instructions=True, # Never compress system instructions
)
original_prompt = "Here is the full context of the document..." # 2000 tokens
compressed = compressor.compress(original_prompt)
print(f"Reduced from {original_prompt.token_count} to {compressed.token_count} tokens")
print(f"Estimated savings: ${compressed.cost_savings:.3f}")
Configuration
# cost_optimizer_config.yaml
tracking:
storage: "sqlite:///costs.db" # sqlite | postgres
retention_days: 365
log_prompts: false # Don't store prompt text (privacy)
routing:
classifier: "keyword" # keyword | llm | embedding
rules:
- complexity: "low"
keywords: ["simple", "yes/no", "list"]
model: "gpt-4o-mini"
max_tokens: 500
- complexity: "high"
keywords: ["analyze", "compare", "explain in detail"]
model: "gpt-4o"
max_tokens: 4000
default_model: "gpt-4o-mini"
caching:
backend: "redis"
redis_url: "redis://localhost:6379/0"
embedding_model: "text-embedding-3-small"
similarity_threshold: 0.92
ttl_seconds: 86400
max_cache_entries: 100000
alerts:
daily_limit: 100.00
weekly_limit: 500.00
monthly_limit: 1500.00
channels:
- type: "slack"
webhook: "${SLACK_WEBHOOK_URL}"
- type: "email"
to: "user@example.com"
circuit_breaker:
enabled: true
threshold: 200.00 # Hard stop at $200/day
fallback: "return_error" # return_error | use_cache_only
batch:
enabled: true
provider: "openai"
max_batch_size: 1000
check_interval_seconds: 300
Best Practices
- Track before optimizing — Instrument all calls for 1-2 weeks to identify where costs actually come from before making changes.
- Cache deterministic queries — Customer support FAQs, documentation lookups, and classification tasks have high cache hit potential.
-
Route by task, not by user — A simple question from a premium user still only needs
gpt-4o-mini. - Set circuit breakers — A bug in a loop can burn through your monthly budget in minutes. Hard limits prevent this.
- Batch everything that can wait — Nightly report generation, content indexing, and analytics don't need real-time responses.
- Compress long contexts — RAG contexts often contain 80% irrelevant text. Compress before sending to the LLM.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| Semantic cache hit rate is very low | Similarity threshold too high (0.98+) | Lower similarity_threshold to 0.90-0.92 and monitor response quality |
| Model router sends everything to cheap model | Complexity classifier too conservative | Add more keywords for "high" complexity or switch to embedding classifier |
| Budget alerts fire but no one notices | Slack webhook expired or email filtered | Test alert channels weekly; add a secondary channel as backup |
| Batch processing results arrive too late |
check_interval too high or batch API backlogged |
Reduce interval to 60 seconds; set priority on time-sensitive batches |
This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Cost Optimizer] with all files, templates, and documentation for $29.
Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.
Top comments (0)