AI API costs are the new cloud bill. Developers are spending $100-$500/month on Claude Code, Cursor, and custom AI pipelines — and most of that spend is avoidable.
Here are the strategies that actually work, with real numbers.
Strategy 1: Use the Right Model for Each Task (40-60% savings)
This is the single biggest lever. Most developers use one model for everything. That's like using a sports car for grocery runs.
| Task | Expensive Model | Right Model | Savings |
|---|---|---|---|
| Architecture design | Claude Opus ($5/$25) | Claude Opus ($5/$25) | 0% (worth it) |
| Code generation | Claude Opus ($5/$25) | Claude Sonnet ($3/$15) | 40% |
| Test generation | Claude Sonnet ($3/$15) | DeepSeek V3 ($0.27/$1.10) | 93% |
| Documentation | Claude Sonnet ($3/$15) | DeepSeek V3 ($0.27/$1.10) | 93% |
| Linting/formatting | Claude Sonnet ($3/$15) | Claude Haiku ($1/$5) | 67% |
Real example: A developer doing 200 API sessions/month:
- All Sonnet: $270/month
- Smart routing (20% Sonnet, 30% Haiku, 50% DeepSeek): $55/month
- Savings: 80%
from openai import OpenAI
client = OpenAI(base_url="https://api.futurmix.ai/v1", api_key="key")
# Complex task → expensive model
response = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Refactor this authentication system..."}]
)
# Bulk task → cheap model
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": "Add docstrings to all functions in..."}]
)
Strategy 2: Route Through a Discounted Gateway (10-30% savings)
Multi-model API gateways negotiate volume pricing with providers. You get the exact same models at lower per-token costs:
| Model | Direct Price | Via Gateway | Savings |
|---|---|---|---|
| Claude Sonnet 4.6 | $3 / $15 | $2.70 / $13.50 | 10% |
| Claude Opus 4.7 | $5 / $25 | $4.50 / $22.50 | 10% |
| GPT-5.5 | $3 / $12 | $2.10 / $8.40 | 30% |
| DeepSeek V3 | $0.27 / $1.10 | $0.19 / $0.77 | 30% |
This stacks with Strategy 1. Same code, same models, lower prices.
Setup for common tools:
# Claude Code
export ANTHROPIC_BASE_URL="https://api.futurmix.ai"
# Aider
aider --openai-api-base https://api.futurmix.ai/v1
# Cursor: Settings → Models → Custom API Base
Strategy 3: Reduce Context Size (20-40% savings)
Every file in your context window costs tokens. Most codebases send far more context than the model needs.
For Claude Code:
Create a .claudeignore file to exclude irrelevant directories:
node_modules/
dist/
build/
*.lock
*.min.js
__pycache__/
.git/
coverage/
For Aider:
Use .aiderignore and limit the repo map:
aider --map-tokens 1024 # limit repo map to 1024 tokens
For custom pipelines:
Be surgical with what you include in the prompt. Don't dump entire files when you only need a function.
# Bad: sending entire file (10K tokens)
prompt = f"Fix the bug in this file:\n{entire_file_content}"
# Good: sending relevant function (500 tokens)
prompt = f"Fix the bug in this function:\n{function_source}"
Strategy 4: Cache Responses (15-30% savings)
If you're sending the same (or similar) prompts repeatedly, cache the responses:
import hashlib
import json
import os
CACHE_DIR = ".ai-cache"
def cached_completion(client, model, messages, **kwargs):
# Create cache key from prompt + model
key = hashlib.sha256(
json.dumps({"model": model, "messages": messages}).encode()
).hexdigest()
cache_path = os.path.join(CACHE_DIR, f"{key}.json")
# Return cached response if exists
if os.path.exists(cache_path):
with open(cache_path) as f:
return json.load(f)
# Make API call
response = client.chat.completions.create(
model=model, messages=messages, **kwargs
)
# Cache response
os.makedirs(CACHE_DIR, exist_ok=True)
result = {
"content": response.choices[0].message.content,
"model": model,
"cached": True
}
with open(cache_path, 'w') as f:
json.dump(result, f)
return result
This is especially effective for:
- Code review prompts on unchanged files
- Documentation generation (regenerating same docs)
- Test generation (test suites don't change often)
Strategy 5: Use Prompt Caching (Anthropic-specific, up to 90% on cached tokens)
Anthropic offers prompt caching — if the same prefix appears across requests, cached tokens cost 90% less:
from anthropic import Anthropic
client = Anthropic()
# First request: full price for all tokens
response = client.messages.create(
model="claude-sonnet-4-6-20260514",
max_tokens=1024,
system=[{
"type": "text",
"text": large_system_prompt, # 10K tokens
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": "Question 1"}]
)
# Second request: 90% off for cached system prompt tokens
response = client.messages.create(
model="claude-sonnet-4-6-20260514",
max_tokens=1024,
system=[{
"type": "text",
"text": large_system_prompt, # cached! 90% cheaper
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": "Question 2"}]
)
If your system prompt is 10K tokens and you make 100 requests, prompt caching saves:
- Without caching: 10K × 100 × $3/M = $3.00
- With caching: 10K × $3/M + 10K × 99 × $0.30/M = $0.33
- Savings: 89%
Strategy 6: Batch Similar Requests (10-20% savings)
Instead of making individual API calls, batch similar tasks:
# Bad: 50 separate API calls for 50 functions
for func in functions:
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": f"Add docstring to: {func}"}]
)
# Good: one API call with all functions batched
all_functions = "\n\n".join(functions)
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": f"Add docstrings to all functions:\n{all_functions}"}]
)
Batching reduces:
- Per-request overhead (connection setup, headers)
- System prompt duplication (sent once instead of 50 times)
- Total token count (model processes context once)
Strategy 7: Monitor and Set Alerts
You can't optimize what you don't measure. Track your API spend:
class CostMonitor:
def __init__(self, monthly_budget=100):
self.budget = monthly_budget
self.spent = 0
def track(self, model, input_tokens, output_tokens):
# Calculate cost based on model pricing
cost = self._calculate_cost(model, input_tokens, output_tokens)
self.spent += cost
if self.spent > self.budget * 0.8:
print(f"⚠️ WARNING: {self.spent:.2f}/{self.budget} budget used")
return cost
Most API gateways also provide built-in usage dashboards and spending alerts.
Combined Impact
Here's what happens when you stack all strategies:
| Strategy | Savings | Cumulative |
|---|---|---|
| Baseline (Sonnet for everything) | — | $270/mo |
| + Model routing | -60% | $108/mo |
| + Gateway discount | -15% | $92/mo |
| + Context optimization | -25% | $69/mo |
| + Response caching | -20% | $55/mo |
| Total | ~80% | ~$55/mo |
From $270/month to $55/month — same quality, same workflow.
Get Started
FuturMix offers one API key for 22+ models at 10-30% off. OpenAI-compatible, works with Claude Code, Cursor, Aider, and your custom code.
export OPENAI_BASE_URL="https://api.futurmix.ai/v1"
export OPENAI_API_KEY="your-key"
Start with Strategy 1 (model routing) and Strategy 2 (gateway discounts) — they require zero code changes and save the most money immediately.
What's your AI API bill looking like? Share your cost optimization wins in the comments.
Top comments (0)