LLM Cost Optimizer

#ai #llm #machinelearning #python

LLM Cost Optimizer

LLM API costs compound fast — a prototype that costs $5/day can become $500/day in production. This toolkit gives you the instrumentation and strategies to cut LLM spending by 40-70% without sacrificing output quality. Token usage tracking, intelligent model routing, semantic caching, batch processing, and budget alerts — all in one package.

Key Features

Token Usage Tracking — Instrument every LLM call with precise input/output token counts, costs, and latency per model, user, and feature
Smart Model Routing — Automatically route simple queries to cheap models (GPT-4o-mini) and complex queries to powerful models (GPT-4o) based on task complexity scoring
Semantic Caching — Cache responses by semantic similarity, not just exact match. "What's the weather in NYC?" and "NYC weather today?" hit the same cache entry
Batch Processing — Queue non-urgent requests and process them in bulk at 50% lower cost using batch APIs
Budget Alerting — Set daily/weekly/monthly spend limits with Slack/email notifications and automatic circuit breakers
Prompt Compression — Automatically shorten prompts by removing redundant context while preserving meaning
Cost Forecasting — Project future costs based on usage trends and planned feature launches

Quick Start

from cost_optimizer import CostTracker, ModelRouter, SemanticCache

# 1. Wrap your LLM client with cost tracking
tracker = CostTracker(
    storage="sqlite:///costs.db",
    alert_threshold_daily=50.00,  # Alert at $50/day
    alert_channel="slack",
)

# 2. Set up model routing
router = ModelRouter(
    rules=[
        {"complexity": "low", "model": "gpt-4o-mini", "max_tokens": 500},
        {"complexity": "medium", "model": "gpt-4o-mini", "max_tokens": 2000},
        {"complexity": "high", "model": "gpt-4o", "max_tokens": 4000},
    ],
    complexity_classifier="keyword",  # keyword | llm | embedding
)

# 3. Add semantic caching
cache = SemanticCache(
    backend="redis",
    embedding_model="text-embedding-3-small",
    similarity_threshold=0.92,
    ttl_seconds=86400,
)

# 4. Use in your application
@tracker.track(feature="customer_support")
@cache.cached()
def answer_question(question: str) -> str:
    model = router.select_model(question)
    response = llm_client.chat(model=model, messages=[{"role": "user", "content": question}])
    return response.content

answer = answer_question("What's your return policy?")

Architecture

Application Request
        │
        ▼
┌───────────────┐
│ Cost Tracker  │──── Log request metadata (pre-call)
└───────┬───────┘
        ▼
┌───────────────┐
│ Semantic Cache│──── Hit? Return cached response
└───────┬───────┘
        │ Miss
        ▼
┌───────────────┐
│ Model Router  │──── Score complexity → select model
└───────┬───────┘
        ▼
┌───────────────┐
│ Prompt Compress│──── Shorten prompt if over budget
└───────┬───────┘
        │
        ├── Urgent ──▶ Direct API call
        │
        └── Non-urgent ──▶ Batch queue (50% cheaper)
                              │
        ┌─────────────────────┘
        ▼
┌───────────────┐
│ Cost Tracker  │──── Log tokens, cost, latency (post-call)
└───────┬───────┘
        ▼
┌───────────────┐
│ Budget Check  │──── Over limit? Alert + circuit break
└───────────────┘

Usage Examples

Cost Dashboard Queries

from cost_optimizer import CostTracker

tracker = CostTracker(storage="sqlite:///costs.db")

# Daily cost breakdown by model
daily = tracker.report(period="today", group_by="model")
for row in daily:
    print(f"{row.model}: {row.requests} reqs, {row.tokens:,} tokens, ${row.cost:.2f}")

# Weekly trend by feature
weekly = tracker.report(period="7d", group_by="feature")
for row in weekly:
    print(f"{row.feature}: ${row.cost:.2f} ({row.cost_change:+.1%} vs last week)")

# Identify most expensive queries
expensive = tracker.top_queries(period="24h", limit=10, sort_by="cost")
for q in expensive:
    print(f"${q.cost:.3f} | {q.model} | {q.prompt_preview[:80]}...")

Batch Processing for Non-Urgent Requests

from cost_optimizer.batch import BatchQueue

queue = BatchQueue(
    provider="openai",
    check_interval=300,    # Check for results every 5 minutes
    max_batch_size=1000,
)

# Queue requests (returns immediately)
job_ids = []
for question in faq_questions:
    job_id = queue.enqueue(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    job_ids.append(job_id)

# Process batch (50% cheaper than individual calls)
results = queue.process_batch()
for job_id, response in zip(job_ids, results):
    print(f"{job_id}: {response.content[:100]}")

Prompt Compression

from cost_optimizer.compression import PromptCompressor

compressor = PromptCompressor(
    method="extractive",       # extractive | abstractive
    target_ratio=0.6,          # Reduce prompt to 60% of original length
    preserve_instructions=True, # Never compress system instructions
)

original_prompt = "Here is the full context of the document..." # 2000 tokens
compressed = compressor.compress(original_prompt)
print(f"Reduced from {original_prompt.token_count} to {compressed.token_count} tokens")
print(f"Estimated savings: ${compressed.cost_savings:.3f}")

Configuration

# cost_optimizer_config.yaml
tracking:
  storage: "sqlite:///costs.db"   # sqlite | postgres
  retention_days: 365
  log_prompts: false              # Don't store prompt text (privacy)

routing:
  classifier: "keyword"           # keyword | llm | embedding
  rules:
    - complexity: "low"
      keywords: ["simple", "yes/no", "list"]
      model: "gpt-4o-mini"
      max_tokens: 500
    - complexity: "high"
      keywords: ["analyze", "compare", "explain in detail"]
      model: "gpt-4o"
      max_tokens: 4000
  default_model: "gpt-4o-mini"

caching:
  backend: "redis"
  redis_url: "redis://localhost:6379/0"
  embedding_model: "text-embedding-3-small"
  similarity_threshold: 0.92
  ttl_seconds: 86400
  max_cache_entries: 100000

alerts:
  daily_limit: 100.00
  weekly_limit: 500.00
  monthly_limit: 1500.00
  channels:
    - type: "slack"
      webhook: "${SLACK_WEBHOOK_URL}"
    - type: "email"
      to: "user@example.com"
  circuit_breaker:
    enabled: true
    threshold: 200.00            # Hard stop at $200/day
    fallback: "return_error"     # return_error | use_cache_only

batch:
  enabled: true
  provider: "openai"
  max_batch_size: 1000
  check_interval_seconds: 300

Best Practices

Track before optimizing — Instrument all calls for 1-2 weeks to identify where costs actually come from before making changes.
Cache deterministic queries — Customer support FAQs, documentation lookups, and classification tasks have high cache hit potential.
Route by task, not by user — A simple question from a premium user still only needs gpt-4o-mini.
Set circuit breakers — A bug in a loop can burn through your monthly budget in minutes. Hard limits prevent this.
Batch everything that can wait — Nightly report generation, content indexing, and analytics don't need real-time responses.
Compress long contexts — RAG contexts often contain 80% irrelevant text. Compress before sending to the LLM.

Troubleshooting

Problem	Cause	Fix
Semantic cache hit rate is very low	Similarity threshold too high (0.98+)	Lower `similarity_threshold` to 0.90-0.92 and monitor response quality
Model router sends everything to cheap model	Complexity classifier too conservative	Add more keywords for "high" complexity or switch to `embedding` classifier
Budget alerts fire but no one notices	Slack webhook expired or email filtered	Test alert channels weekly; add a secondary channel as backup
Batch processing results arrive too late	`check_interval` too high or batch API backlogged	Reduce interval to 60 seconds; set priority on time-sensitive batches

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Cost Optimizer] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →