Originally shared on LinkedIn where it reached 10,000+ professionals. Expanded with technical details and actionable strategies here.
A client was burning through ₹85,000/month on AI API calls.
In one weekend, we cut that to under ₹12,000.
I'm not exaggerating. This is real money saved by a real company that had no idea they were hemorrhaging cash on AI infrastructure.
The worst part? They thought that's just "the cost of doing AI."
It's not.
If you're using AI APIs in production—Claude, GPT-4, Gemini, whatever—there's a 90% chance you're overspending. This article walks through exactly how we diagnosed the problem and fixed it.
The Problem: Why Your AI Bill Looks Like a Mortgage Payment
Before the audit, here's what the client was doing:
Using Opus (₹15/1M tokens) for simple tasks — asking GPT-4-level models to classify customer emails or generate summaries when Haiku (₹0.80/1M tokens) would work fine.
Zero prompt caching — sending the same 50KB system prompt with every API call. That's like shipping the same dictionary with every word lookup.
No batch processing — every request hit the API in real-time, even for non-urgent tasks that could run at night at 50% discount.
Redundant API calls — the app was calling the same input multiple times within seconds because there was no response caching. Imagine asking your friend the same question twice instead of remembering the answer.
Token bloat — prompts weren't optimized. XML wrappers, unnecessary examples, verbose instructions—all padding the token count.
The client thought they needed the "best" model for everything. In reality, they needed the right model for each task.
The Audit: Finding Your Money Leaks
Here's the systematic process we followed:
Step 1: Collect Everything
We enabled detailed logging on all API calls:
import json
from datetime import datetime
def log_api_call(model, input_tokens, output_tokens, task_type, prompt_hash):
log_entry = {
"timestamp": datetime.now().isoformat(),
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"task_type": task_type,
"prompt_hash": prompt_hash, # To detect redundancy
"total_cost_usd": calculate_cost(model, input_tokens, output_tokens)
}
with open("api_audit.jsonl", "a") as f:
f.write(json.dumps(log_entry) + "\n")
One week of logs told us everything.
Step 2: Categorize by Task Type
We grouped calls into buckets:
- Email classification — 12,000 calls/week (should be Haiku)
- Content summarization — 4,000 calls/week (Sonnet max)
- Complex research synthesis — 800 calls/week (Opus only)
- Code review — 2,200 calls/week (Sonnet)
- Prompt engineering iteration — 3,500 calls/week (varies, but lots of waste here)
Step 3: Match Model to Complexity
This is where the magic happens. Not every task needs Opus.
| Task Type | Current Model | Optimal Model | Annual Savings |
|---|---|---|---|
| Email classification | GPT-4 (₹12/1M in) | Claude Haiku (₹0.80/1M in) | ₹42,000+ |
| Customer support QA | GPT-4 | Claude Sonnet (₹3/1M in) | ₹28,000+ |
| Content summarization | GPT-4 Turbo | Claude Sonnet | ₹18,500+ |
| Complex research | GPT-4 Turbo | Claude Opus (₹15/1M) | ₹0 (necessary) |
| Prompt iteration | GPT-4 | Claude Haiku | ₹8,500+ |
Step 4: Spot Redundancy
Using the prompt hash from our logs, we found:
- Same system prompt sent 47,000 times in one month. With caching, that would be 1 full call + 46,999 cached calls (90% cost reduction on those).
-
Same user inputs processed multiple times (within seconds). A simple
(input_hash, output)cache would eliminate this. - Retry loops without exponential backoff — failed calls being retried immediately, multiplying tokens wasted.
The 5 Optimization Strategies That Worked
1. Model Matching (30% savings)
Simple rule: Use the simplest model that works for your task.
- Haiku — Classification, formatting, extraction, simple routing (₹0.80/1M input)
- Sonnet — Summarization, content generation, moderate reasoning (₹3/1M input)
- Opus — Complex multi-step reasoning, code review, research synthesis (₹15/1M input)
We moved 60% of the client's workload from Opus/GPT-4 to Haiku/Sonnet.
Testing: Don't guess. Run a small batch through each model and measure latency + quality. Usually Haiku is 10% quality loss with 50-90% cost reduction.
2. Prompt Caching (45% savings)
Claude's prompt caching feature is a game-changer. It works like this:
- First call: Send full prompt. (₹15 per 1M tokens)
- Next 5,000+ calls: Send only the new query part. (₹0.30 per 1M tokens cached)
import anthropic
client = anthropic.Anthropic()
# Static system prompt (gets cached)
SYSTEM_PROMPT = """You are a customer support email classifier.
Categories: billing, technical_support, feature_request, complaint, other.
Respond with JSON: {"category": "...", "confidence": 0.95, "reasoning": "..."}"""
# This prompt gets cached on first call
response = client.messages.create(
model="claude-opus-4-1-20250805",
max_tokens=200,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache for 5 mins
}
],
messages=[
{"role": "user", "content": "Can you help me reset my password?"}
]
)
print(response.usage)
# usage.cache_read_input_tokens > 0 means caching worked
For the client:
- 50KB system prompt × 30,000 calls/month = 1.5B cached tokens
- Cached tokens cost 90% less
- Monthly savings: ₹18,000+
3. Batch Processing (35% savings for async tasks)
Not every API call needs to be real-time. Use the batch API for non-urgent tasks:
- Normal API: ₹15/1M input tokens
- Batch API: ₹5/1M input tokens (50% discount, 24-hour turnaround)
import anthropic
import json
client = anthropic.Anthropic()
# Prepare batch requests
batch_requests = []
for email in emails_to_classify: # 10,000 emails
batch_requests.append({
"custom_id": f"email-{email['id']}",
"params": {
"model": "claude-opus-4-1-20250805",
"max_tokens": 200,
"system": "Classify this email...",
"messages": [
{"role": "user", "content": email['text']}
]
}
})
# Submit batch
batch = client.beta.messages.batches.create(
requests=batch_requests
)
print(f"Batch ID: {batch.id}")
# Process results later
For the client's non-urgent tasks (nightly reports, weekly summaries), batch processing cut costs in half.
4. Response Caching (20% savings)
Don't ask the API the same question twice:
import hashlib
cache = {} # Or Redis for production
def get_classification(text, model):
# Create deterministic hash of input
input_hash = hashlib.sha256(text.encode()).hexdigest()
# Check cache first
if input_hash in cache:
return cache[input_hash]
# Call API
response = client.messages.create(
model=model,
messages=[{"role": "user", "content": text}]
)
result = response.content[0].text
cache[input_hash] = result
return result
# First call: hits API
result1 = get_classification("Can I reset my password?", "claude-haiku-4-5-20251001")
# Second call: same input, returns from cache instantly
result2 = get_classification("Can I reset my password?", "claude-haiku-4-5-20251001")
The client had duplicate email classifications happening within seconds. A simple cache (Redis, in-memory, doesn't matter) eliminated ~20% of redundant calls.
5. Token Optimization (15% savings)
Smaller prompts = fewer tokens = lower costs.
Before:
You are an expert customer service AI. Your role is to take customer emails
and classify them into one of several predefined categories. Please read
the email carefully and determine which category it belongs to.
Categories:
1. Billing & Payments
2. Technical Support & Bugs
3. Feature Requests
4. Complaints & Escalations
5. Other
Please respond in JSON format with the following structure:
{
"category": "...",
"confidence": 0.0-1.0,
"reasoning": "..."
}
Here's the email:
After:
Classify the email into: billing, technical, feature_request, complaint, other.
Response: {"category": "...", "confidence": 0.0-1.0}
Email:
Same output, 60% fewer tokens.
The Results: Before & After
| Metric | Before | After | Savings |
|---|---|---|---|
| Monthly API cost | ₹85,000 | ₹12,000 | ₹73,000 (86%) |
| Avg tokens/call | 2,840 | 1,100 | 61% reduction |
| API calls/month | 340,000 | 280,000 | 18% fewer (redundancy removed) |
| Latency (p95) | 850ms | 420ms | 50% faster (cache hits) |
| System accuracy | 94% | 94.2% | +0.2% (better model matching) |
The real win? The client now has better performance AND lower costs. That's the point of optimization—efficiency compounds.
Your Turn: The Checklist
You can apply this to any AI project. Here's what to do Monday morning:
Week 1: Audit
- [ ] Enable detailed logging on all API calls (model, tokens, task type)
- [ ] Collect one week of data
- [ ] Categorize tasks by complexity
- [ ] Calculate per-task costs
Week 2: Low-Hanging Fruit
- [ ] Downgrade non-critical tasks to cheaper models (test quality first)
- [ ] Implement response caching if you have duplicate inputs
- [ ] Optimize prompts (remove verbosity, examples, unnecessary structure)
Week 3: Advanced Optimizations
- [ ] Set up prompt caching for static system prompts
- [ ] Migrate non-urgent tasks to batch processing
- [ ] Implement request deduplication
Week 4: Measure & Iterate
- [ ] Compare new costs vs. baseline
- [ ] Monitor quality metrics (didn't go down, right?)
- [ ] Automate the audit—keep logging, check monthly
Tools & Resources
Logging & Analytics:
- Python
anthropicSDK (built-in logging) -
langsmith.com(trace API calls across frameworks) - Datadog / Elastic (if you're at scale)
Caching:
- Redis (production caching)
- LRU cache (Python
functools.lru_cachefor testing) - Claude prompt caching (native, no setup)
Batch Processing:
- Claude Batch API (native)
- OpenAI Batch API
Cost Calculators:
-
anthropic.com/pricing(Claude) -
openai.com/pricing(GPT models) - Build your own:
(input_tokens × input_rate + output_tokens × output_rate) / 1_000_000
The Bigger Picture
This client went from thinking "AI is expensive" to building a sustainable, cost-efficient AI infrastructure.
The same principles apply whether you're:
- Building an AI chatbot for customer support
- Automating content generation at scale
- Running batch analysis on document collections
- Fine-tuning models for specific tasks
The key insight: Every ₹ you save on API costs is ₹1 that can go toward better infrastructure, faster iteration, or hiring more engineers.
Start with the audit. Everything else follows.
About the Author
I'm Archit Mittal (@automate-archit), an automation engineer helping companies save money and time with AI workflows. I've helped 30+ clients optimize their AI spending, cut API costs by 50-85%, and build sustainable automation architectures.
If you're dealing with similar cost problems or want to discuss AI automation strategies, let's connect:
- LinkedIn: @automate-archit
- Twitter/X: @automate_archit
- 1-on-1 consultations: topmate.io/automate_archit
Feel free to share this with your team. Cost optimization benefits everyone.
Top comments (0)