DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

LLM Cost Optimizer

LLM Cost Optimizer

LLM API costs compound fast — a prototype that costs $5/day can become $500/day in production. This toolkit gives you the instrumentation and strategies to cut LLM spending by 40-70% without sacrificing output quality. Token usage tracking, intelligent model routing, semantic caching, batch processing, and budget alerts — all in one package.

Key Features

  • Token Usage Tracking — Instrument every LLM call with precise input/output token counts, costs, and latency per model, user, and feature
  • Smart Model Routing — Automatically route simple queries to cheap models (GPT-4o-mini) and complex queries to powerful models (GPT-4o) based on task complexity scoring
  • Semantic Caching — Cache responses by semantic similarity, not just exact match. "What's the weather in NYC?" and "NYC weather today?" hit the same cache entry
  • Batch Processing — Queue non-urgent requests and process them in bulk at 50% lower cost using batch APIs
  • Budget Alerting — Set daily/weekly/monthly spend limits with Slack/email notifications and automatic circuit breakers
  • Prompt Compression — Automatically shorten prompts by removing redundant context while preserving meaning
  • Cost Forecasting — Project future costs based on usage trends and planned feature launches

Quick Start

from cost_optimizer import CostTracker, ModelRouter, SemanticCache

# 1. Wrap your LLM client with cost tracking
tracker = CostTracker(
    storage="sqlite:///costs.db",
    alert_threshold_daily=50.00,  # Alert at $50/day
    alert_channel="slack",
)

# 2. Set up model routing
router = ModelRouter(
    rules=[
        {"complexity": "low", "model": "gpt-4o-mini", "max_tokens": 500},
        {"complexity": "medium", "model": "gpt-4o-mini", "max_tokens": 2000},
        {"complexity": "high", "model": "gpt-4o", "max_tokens": 4000},
    ],
    complexity_classifier="keyword",  # keyword | llm | embedding
)

# 3. Add semantic caching
cache = SemanticCache(
    backend="redis",
    embedding_model="text-embedding-3-small",
    similarity_threshold=0.92,
    ttl_seconds=86400,
)

# 4. Use in your application
@tracker.track(feature="customer_support")
@cache.cached()
def answer_question(question: str) -> str:
    model = router.select_model(question)
    response = llm_client.chat(model=model, messages=[{"role": "user", "content": question}])
    return response.content

answer = answer_question("What's your return policy?")
Enter fullscreen mode Exit fullscreen mode

Architecture

Application Request
        │
        ▼
┌───────────────┐
│ Cost Tracker  │──── Log request metadata (pre-call)
└───────┬───────┘
        ▼
┌───────────────┐
│ Semantic Cache│──── Hit? Return cached response
└───────┬───────┘
        │ Miss
        ▼
┌───────────────┐
│ Model Router  │──── Score complexity → select model
└───────┬───────┘
        ▼
┌───────────────┐
│ Prompt Compress│──── Shorten prompt if over budget
└───────┬───────┘
        │
        ├── Urgent ──▶ Direct API call
        │
        └── Non-urgent ──▶ Batch queue (50% cheaper)
                              │
        ┌─────────────────────┘
        ▼
┌───────────────┐
│ Cost Tracker  │──── Log tokens, cost, latency (post-call)
└───────┬───────┘
        ▼
┌───────────────┐
│ Budget Check  │──── Over limit? Alert + circuit break
└───────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Cost Dashboard Queries

from cost_optimizer import CostTracker

tracker = CostTracker(storage="sqlite:///costs.db")

# Daily cost breakdown by model
daily = tracker.report(period="today", group_by="model")
for row in daily:
    print(f"{row.model}: {row.requests} reqs, {row.tokens:,} tokens, ${row.cost:.2f}")

# Weekly trend by feature
weekly = tracker.report(period="7d", group_by="feature")
for row in weekly:
    print(f"{row.feature}: ${row.cost:.2f} ({row.cost_change:+.1%} vs last week)")

# Identify most expensive queries
expensive = tracker.top_queries(period="24h", limit=10, sort_by="cost")
for q in expensive:
    print(f"${q.cost:.3f} | {q.model} | {q.prompt_preview[:80]}...")
Enter fullscreen mode Exit fullscreen mode

Batch Processing for Non-Urgent Requests

from cost_optimizer.batch import BatchQueue

queue = BatchQueue(
    provider="openai",
    check_interval=300,    # Check for results every 5 minutes
    max_batch_size=1000,
)

# Queue requests (returns immediately)
job_ids = []
for question in faq_questions:
    job_id = queue.enqueue(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    job_ids.append(job_id)

# Process batch (50% cheaper than individual calls)
results = queue.process_batch()
for job_id, response in zip(job_ids, results):
    print(f"{job_id}: {response.content[:100]}")
Enter fullscreen mode Exit fullscreen mode

Prompt Compression

from cost_optimizer.compression import PromptCompressor

compressor = PromptCompressor(
    method="extractive",       # extractive | abstractive
    target_ratio=0.6,          # Reduce prompt to 60% of original length
    preserve_instructions=True, # Never compress system instructions
)

original_prompt = "Here is the full context of the document..." # 2000 tokens
compressed = compressor.compress(original_prompt)
print(f"Reduced from {original_prompt.token_count} to {compressed.token_count} tokens")
print(f"Estimated savings: ${compressed.cost_savings:.3f}")
Enter fullscreen mode Exit fullscreen mode

Configuration

# cost_optimizer_config.yaml
tracking:
  storage: "sqlite:///costs.db"   # sqlite | postgres
  retention_days: 365
  log_prompts: false              # Don't store prompt text (privacy)

routing:
  classifier: "keyword"           # keyword | llm | embedding
  rules:
    - complexity: "low"
      keywords: ["simple", "yes/no", "list"]
      model: "gpt-4o-mini"
      max_tokens: 500
    - complexity: "high"
      keywords: ["analyze", "compare", "explain in detail"]
      model: "gpt-4o"
      max_tokens: 4000
  default_model: "gpt-4o-mini"

caching:
  backend: "redis"
  redis_url: "redis://localhost:6379/0"
  embedding_model: "text-embedding-3-small"
  similarity_threshold: 0.92
  ttl_seconds: 86400
  max_cache_entries: 100000

alerts:
  daily_limit: 100.00
  weekly_limit: 500.00
  monthly_limit: 1500.00
  channels:
    - type: "slack"
      webhook: "${SLACK_WEBHOOK_URL}"
    - type: "email"
      to: "user@example.com"
  circuit_breaker:
    enabled: true
    threshold: 200.00            # Hard stop at $200/day
    fallback: "return_error"     # return_error | use_cache_only

batch:
  enabled: true
  provider: "openai"
  max_batch_size: 1000
  check_interval_seconds: 300
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Track before optimizing — Instrument all calls for 1-2 weeks to identify where costs actually come from before making changes.
  2. Cache deterministic queries — Customer support FAQs, documentation lookups, and classification tasks have high cache hit potential.
  3. Route by task, not by user — A simple question from a premium user still only needs gpt-4o-mini.
  4. Set circuit breakers — A bug in a loop can burn through your monthly budget in minutes. Hard limits prevent this.
  5. Batch everything that can wait — Nightly report generation, content indexing, and analytics don't need real-time responses.
  6. Compress long contexts — RAG contexts often contain 80% irrelevant text. Compress before sending to the LLM.

Troubleshooting

Problem Cause Fix
Semantic cache hit rate is very low Similarity threshold too high (0.98+) Lower similarity_threshold to 0.90-0.92 and monitor response quality
Model router sends everything to cheap model Complexity classifier too conservative Add more keywords for "high" complexity or switch to embedding classifier
Budget alerts fire but no one notices Slack webhook expired or email filtered Test alert channels weekly; add a secondary channel as backup
Batch processing results arrive too late check_interval too high or batch API backlogged Reduce interval to 60 seconds; set priority on time-sensitive batches

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Cost Optimizer] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)