Archit Mittal

Posted on Mar 21

How I Saved a Client ₹85,000/Month on AI API Costs — A Practical Breakdown

#ai #productivity #automation #beginners

Originally shared on LinkedIn where it reached 10,000+ professionals. Expanded with technical details and actionable strategies here.

A client was burning through ₹85,000/month on AI API calls.

In one weekend, we cut that to under ₹12,000.

I'm not exaggerating. This is real money saved by a real company that had no idea they were hemorrhaging cash on AI infrastructure.

The worst part? They thought that's just "the cost of doing AI."

It's not.

If you're using AI APIs in production—Claude, GPT-4, Gemini, whatever—there's a 90% chance you're overspending. This article walks through exactly how we diagnosed the problem and fixed it.

The Problem: Why Your AI Bill Looks Like a Mortgage Payment

Before the audit, here's what the client was doing:

Using Opus (₹15/1M tokens) for simple tasks — asking GPT-4-level models to classify customer emails or generate summaries when Haiku (₹0.80/1M tokens) would work fine.
Zero prompt caching — sending the same 50KB system prompt with every API call. That's like shipping the same dictionary with every word lookup.
No batch processing — every request hit the API in real-time, even for non-urgent tasks that could run at night at 50% discount.
Redundant API calls — the app was calling the same input multiple times within seconds because there was no response caching. Imagine asking your friend the same question twice instead of remembering the answer.
Token bloat — prompts weren't optimized. XML wrappers, unnecessary examples, verbose instructions—all padding the token count.

The client thought they needed the "best" model for everything. In reality, they needed the right model for each task.

The Audit: Finding Your Money Leaks

Here's the systematic process we followed:

Step 1: Collect Everything

We enabled detailed logging on all API calls:

import json
from datetime import datetime

def log_api_call(model, input_tokens, output_tokens, task_type, prompt_hash):
    log_entry = {
        "timestamp": datetime.now().isoformat(),
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "task_type": task_type,
        "prompt_hash": prompt_hash,  # To detect redundancy
        "total_cost_usd": calculate_cost(model, input_tokens, output_tokens)
    }
    with open("api_audit.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")

One week of logs told us everything.

Step 2: Categorize by Task Type

We grouped calls into buckets:

Email classification — 12,000 calls/week (should be Haiku)
Content summarization — 4,000 calls/week (Sonnet max)
Complex research synthesis — 800 calls/week (Opus only)
Code review — 2,200 calls/week (Sonnet)
Prompt engineering iteration — 3,500 calls/week (varies, but lots of waste here)

Step 3: Match Model to Complexity

This is where the magic happens. Not every task needs Opus.

Task Type	Current Model	Optimal Model	Annual Savings
Email classification	GPT-4 (₹12/1M in)	Claude Haiku (₹0.80/1M in)	₹42,000+
Customer support QA	GPT-4	Claude Sonnet (₹3/1M in)	₹28,000+
Content summarization	GPT-4 Turbo	Claude Sonnet	₹18,500+
Complex research	GPT-4 Turbo	Claude Opus (₹15/1M)	₹0 (necessary)
Prompt iteration	GPT-4	Claude Haiku	₹8,500+

Step 4: Spot Redundancy

Using the prompt hash from our logs, we found:

Same system prompt sent 47,000 times in one month. With caching, that would be 1 full call + 46,999 cached calls (90% cost reduction on those).
Same user inputs processed multiple times (within seconds). A simple (input_hash, output) cache would eliminate this.
Retry loops without exponential backoff — failed calls being retried immediately, multiplying tokens wasted.

The 5 Optimization Strategies That Worked

1. Model Matching (30% savings)

Simple rule: Use the simplest model that works for your task.

Haiku — Classification, formatting, extraction, simple routing (₹0.80/1M input)
Sonnet — Summarization, content generation, moderate reasoning (₹3/1M input)
Opus — Complex multi-step reasoning, code review, research synthesis (₹15/1M input)

We moved 60% of the client's workload from Opus/GPT-4 to Haiku/Sonnet.

Testing: Don't guess. Run a small batch through each model and measure latency + quality. Usually Haiku is 10% quality loss with 50-90% cost reduction.

2. Prompt Caching (45% savings)

Claude's prompt caching feature is a game-changer. It works like this:

First call: Send full prompt. (₹15 per 1M tokens)
Next 5,000+ calls: Send only the new query part. (₹0.30 per 1M tokens cached)

import anthropic

client = anthropic.Anthropic()

# Static system prompt (gets cached)
SYSTEM_PROMPT = """You are a customer support email classifier.
Categories: billing, technical_support, feature_request, complaint, other.
Respond with JSON: {"category": "...", "confidence": 0.95, "reasoning": "..."}"""

# This prompt gets cached on first call
response = client.messages.create(
    model="claude-opus-4-1-20250805",
    max_tokens=200,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Cache for 5 mins
        }
    ],
    messages=[
        {"role": "user", "content": "Can you help me reset my password?"}
    ]
)

print(response.usage)
# usage.cache_read_input_tokens > 0 means caching worked

For the client:

50KB system prompt × 30,000 calls/month = 1.5B cached tokens
Cached tokens cost 90% less
Monthly savings: ₹18,000+

3. Batch Processing (35% savings for async tasks)

Not every API call needs to be real-time. Use the batch API for non-urgent tasks:

Normal API: ₹15/1M input tokens
Batch API: ₹5/1M input tokens (50% discount, 24-hour turnaround)

import anthropic
import json

client = anthropic.Anthropic()

# Prepare batch requests
batch_requests = []
for email in emails_to_classify:  # 10,000 emails
    batch_requests.append({
        "custom_id": f"email-{email['id']}",
        "params": {
            "model": "claude-opus-4-1-20250805",
            "max_tokens": 200,
            "system": "Classify this email...",
            "messages": [
                {"role": "user", "content": email['text']}
            ]
        }
    })

# Submit batch
batch = client.beta.messages.batches.create(
    requests=batch_requests
)

print(f"Batch ID: {batch.id}")
# Process results later

For the client's non-urgent tasks (nightly reports, weekly summaries), batch processing cut costs in half.

4. Response Caching (20% savings)

Don't ask the API the same question twice:

import hashlib

cache = {}  # Or Redis for production

def get_classification(text, model):
    # Create deterministic hash of input
    input_hash = hashlib.sha256(text.encode()).hexdigest()

    # Check cache first
    if input_hash in cache:
        return cache[input_hash]

    # Call API
    response = client.messages.create(
        model=model,
        messages=[{"role": "user", "content": text}]
    )

    result = response.content[0].text
    cache[input_hash] = result
    return result

# First call: hits API
result1 = get_classification("Can I reset my password?", "claude-haiku-4-5-20251001")

# Second call: same input, returns from cache instantly
result2 = get_classification("Can I reset my password?", "claude-haiku-4-5-20251001")

The client had duplicate email classifications happening within seconds. A simple cache (Redis, in-memory, doesn't matter) eliminated ~20% of redundant calls.

5. Token Optimization (15% savings)

Smaller prompts = fewer tokens = lower costs.

Before:

You are an expert customer service AI. Your role is to take customer emails
and classify them into one of several predefined categories. Please read
the email carefully and determine which category it belongs to.

Categories:
1. Billing & Payments
2. Technical Support & Bugs
3. Feature Requests
4. Complaints & Escalations
5. Other

Please respond in JSON format with the following structure:
{
  "category": "...",
  "confidence": 0.0-1.0,
  "reasoning": "..."
}

Here's the email:

After:

Classify the email into: billing, technical, feature_request, complaint, other.
Response: {"category": "...", "confidence": 0.0-1.0}

Email:

Same output, 60% fewer tokens.

The Results: Before & After

Metric	Before	After	Savings
Monthly API cost	₹85,000	₹12,000	₹73,000 (86%)
Avg tokens/call	2,840	1,100	61% reduction
API calls/month	340,000	280,000	18% fewer (redundancy removed)
Latency (p95)	850ms	420ms	50% faster (cache hits)
System accuracy	94%	94.2%	+0.2% (better model matching)

The real win? The client now has better performance AND lower costs. That's the point of optimization—efficiency compounds.

Your Turn: The Checklist

You can apply this to any AI project. Here's what to do Monday morning:

Week 1: Audit

[ ] Enable detailed logging on all API calls (model, tokens, task type)
[ ] Collect one week of data
[ ] Categorize tasks by complexity
[ ] Calculate per-task costs

Week 2: Low-Hanging Fruit

[ ] Downgrade non-critical tasks to cheaper models (test quality first)
[ ] Implement response caching if you have duplicate inputs
[ ] Optimize prompts (remove verbosity, examples, unnecessary structure)

Week 3: Advanced Optimizations

[ ] Set up prompt caching for static system prompts
[ ] Migrate non-urgent tasks to batch processing
[ ] Implement request deduplication

Week 4: Measure & Iterate

[ ] Compare new costs vs. baseline
[ ] Monitor quality metrics (didn't go down, right?)
[ ] Automate the audit—keep logging, check monthly

Tools & Resources

Logging & Analytics:

Python anthropic SDK (built-in logging)
langsmith.com (trace API calls across frameworks)
Datadog / Elastic (if you're at scale)

Caching:

Redis (production caching)
LRU cache (Python functools.lru_cache for testing)
Claude prompt caching (native, no setup)

Batch Processing:

Claude Batch API (native)
OpenAI Batch API

Cost Calculators:

anthropic.com/pricing (Claude)
openai.com/pricing (GPT models)
Build your own: (input_tokens × input_rate + output_tokens × output_rate) / 1_000_000

The Bigger Picture

This client went from thinking "AI is expensive" to building a sustainable, cost-efficient AI infrastructure.

The same principles apply whether you're:

Building an AI chatbot for customer support
Automating content generation at scale
Running batch analysis on document collections
Fine-tuning models for specific tasks

The key insight: Every ₹ you save on API costs is ₹1 that can go toward better infrastructure, faster iteration, or hiring more engineers.

Start with the audit. Everything else follows.

About the Author

I'm Archit Mittal (@automate-archit), an automation engineer helping companies save money and time with AI workflows. I've helped 30+ clients optimize their AI spending, cut API costs by 50-85%, and build sustainable automation architectures.

If you're dealing with similar cost problems or want to discuss AI automation strategies, let's connect:

LinkedIn: @automate-archit
Twitter/X: @automate_archit
1-on-1 consultations: topmate.io/automate_archit

Feel free to share this with your team. Cost optimization benefits everyone.

DEV Community