DEV Community

Archit Mittal
Archit Mittal

Posted on

How I Saved a Client ₹85,000/Month on AI API Costs — A Practical Breakdown

Originally shared on LinkedIn where it reached 10,000+ professionals. Expanded with technical details and actionable strategies here.


A client was burning through ₹85,000/month on AI API calls.

In one weekend, we cut that to under ₹12,000.

I'm not exaggerating. This is real money saved by a real company that had no idea they were hemorrhaging cash on AI infrastructure.

The worst part? They thought that's just "the cost of doing AI."

It's not.

If you're using AI APIs in production—Claude, GPT-4, Gemini, whatever—there's a 90% chance you're overspending. This article walks through exactly how we diagnosed the problem and fixed it.


The Problem: Why Your AI Bill Looks Like a Mortgage Payment

Before the audit, here's what the client was doing:

  1. Using Opus (₹15/1M tokens) for simple tasks — asking GPT-4-level models to classify customer emails or generate summaries when Haiku (₹0.80/1M tokens) would work fine.

  2. Zero prompt caching — sending the same 50KB system prompt with every API call. That's like shipping the same dictionary with every word lookup.

  3. No batch processing — every request hit the API in real-time, even for non-urgent tasks that could run at night at 50% discount.

  4. Redundant API calls — the app was calling the same input multiple times within seconds because there was no response caching. Imagine asking your friend the same question twice instead of remembering the answer.

  5. Token bloat — prompts weren't optimized. XML wrappers, unnecessary examples, verbose instructions—all padding the token count.

The client thought they needed the "best" model for everything. In reality, they needed the right model for each task.


The Audit: Finding Your Money Leaks

Here's the systematic process we followed:

Step 1: Collect Everything

We enabled detailed logging on all API calls:

import json
from datetime import datetime

def log_api_call(model, input_tokens, output_tokens, task_type, prompt_hash):
    log_entry = {
        "timestamp": datetime.now().isoformat(),
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "task_type": task_type,
        "prompt_hash": prompt_hash,  # To detect redundancy
        "total_cost_usd": calculate_cost(model, input_tokens, output_tokens)
    }
    with open("api_audit.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")
Enter fullscreen mode Exit fullscreen mode

One week of logs told us everything.

Step 2: Categorize by Task Type

We grouped calls into buckets:

  • Email classification — 12,000 calls/week (should be Haiku)
  • Content summarization — 4,000 calls/week (Sonnet max)
  • Complex research synthesis — 800 calls/week (Opus only)
  • Code review — 2,200 calls/week (Sonnet)
  • Prompt engineering iteration — 3,500 calls/week (varies, but lots of waste here)

Step 3: Match Model to Complexity

This is where the magic happens. Not every task needs Opus.

Task Type Current Model Optimal Model Annual Savings
Email classification GPT-4 (₹12/1M in) Claude Haiku (₹0.80/1M in) ₹42,000+
Customer support QA GPT-4 Claude Sonnet (₹3/1M in) ₹28,000+
Content summarization GPT-4 Turbo Claude Sonnet ₹18,500+
Complex research GPT-4 Turbo Claude Opus (₹15/1M) ₹0 (necessary)
Prompt iteration GPT-4 Claude Haiku ₹8,500+

Step 4: Spot Redundancy

Using the prompt hash from our logs, we found:

  • Same system prompt sent 47,000 times in one month. With caching, that would be 1 full call + 46,999 cached calls (90% cost reduction on those).
  • Same user inputs processed multiple times (within seconds). A simple (input_hash, output) cache would eliminate this.
  • Retry loops without exponential backoff — failed calls being retried immediately, multiplying tokens wasted.

The 5 Optimization Strategies That Worked

1. Model Matching (30% savings)

Simple rule: Use the simplest model that works for your task.

  • Haiku — Classification, formatting, extraction, simple routing (₹0.80/1M input)
  • Sonnet — Summarization, content generation, moderate reasoning (₹3/1M input)
  • Opus — Complex multi-step reasoning, code review, research synthesis (₹15/1M input)

We moved 60% of the client's workload from Opus/GPT-4 to Haiku/Sonnet.

Testing: Don't guess. Run a small batch through each model and measure latency + quality. Usually Haiku is 10% quality loss with 50-90% cost reduction.

2. Prompt Caching (45% savings)

Claude's prompt caching feature is a game-changer. It works like this:

  • First call: Send full prompt. (₹15 per 1M tokens)
  • Next 5,000+ calls: Send only the new query part. (₹0.30 per 1M tokens cached)
import anthropic

client = anthropic.Anthropic()

# Static system prompt (gets cached)
SYSTEM_PROMPT = """You are a customer support email classifier.
Categories: billing, technical_support, feature_request, complaint, other.
Respond with JSON: {"category": "...", "confidence": 0.95, "reasoning": "..."}"""

# This prompt gets cached on first call
response = client.messages.create(
    model="claude-opus-4-1-20250805",
    max_tokens=200,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Cache for 5 mins
        }
    ],
    messages=[
        {"role": "user", "content": "Can you help me reset my password?"}
    ]
)

print(response.usage)
# usage.cache_read_input_tokens > 0 means caching worked
Enter fullscreen mode Exit fullscreen mode

For the client:

  • 50KB system prompt × 30,000 calls/month = 1.5B cached tokens
  • Cached tokens cost 90% less
  • Monthly savings: ₹18,000+

3. Batch Processing (35% savings for async tasks)

Not every API call needs to be real-time. Use the batch API for non-urgent tasks:

  • Normal API: ₹15/1M input tokens
  • Batch API: ₹5/1M input tokens (50% discount, 24-hour turnaround)
import anthropic
import json

client = anthropic.Anthropic()

# Prepare batch requests
batch_requests = []
for email in emails_to_classify:  # 10,000 emails
    batch_requests.append({
        "custom_id": f"email-{email['id']}",
        "params": {
            "model": "claude-opus-4-1-20250805",
            "max_tokens": 200,
            "system": "Classify this email...",
            "messages": [
                {"role": "user", "content": email['text']}
            ]
        }
    })

# Submit batch
batch = client.beta.messages.batches.create(
    requests=batch_requests
)

print(f"Batch ID: {batch.id}")
# Process results later
Enter fullscreen mode Exit fullscreen mode

For the client's non-urgent tasks (nightly reports, weekly summaries), batch processing cut costs in half.

4. Response Caching (20% savings)

Don't ask the API the same question twice:

import hashlib

cache = {}  # Or Redis for production

def get_classification(text, model):
    # Create deterministic hash of input
    input_hash = hashlib.sha256(text.encode()).hexdigest()

    # Check cache first
    if input_hash in cache:
        return cache[input_hash]

    # Call API
    response = client.messages.create(
        model=model,
        messages=[{"role": "user", "content": text}]
    )

    result = response.content[0].text
    cache[input_hash] = result
    return result

# First call: hits API
result1 = get_classification("Can I reset my password?", "claude-haiku-4-5-20251001")

# Second call: same input, returns from cache instantly
result2 = get_classification("Can I reset my password?", "claude-haiku-4-5-20251001")
Enter fullscreen mode Exit fullscreen mode

The client had duplicate email classifications happening within seconds. A simple cache (Redis, in-memory, doesn't matter) eliminated ~20% of redundant calls.

5. Token Optimization (15% savings)

Smaller prompts = fewer tokens = lower costs.

Before:

You are an expert customer service AI. Your role is to take customer emails
and classify them into one of several predefined categories. Please read
the email carefully and determine which category it belongs to.

Categories:
1. Billing & Payments
2. Technical Support & Bugs
3. Feature Requests
4. Complaints & Escalations
5. Other

Please respond in JSON format with the following structure:
{
  "category": "...",
  "confidence": 0.0-1.0,
  "reasoning": "..."
}

Here's the email:
Enter fullscreen mode Exit fullscreen mode

After:

Classify the email into: billing, technical, feature_request, complaint, other.
Response: {"category": "...", "confidence": 0.0-1.0}

Email:
Enter fullscreen mode Exit fullscreen mode

Same output, 60% fewer tokens.


The Results: Before & After

Metric Before After Savings
Monthly API cost ₹85,000 ₹12,000 ₹73,000 (86%)
Avg tokens/call 2,840 1,100 61% reduction
API calls/month 340,000 280,000 18% fewer (redundancy removed)
Latency (p95) 850ms 420ms 50% faster (cache hits)
System accuracy 94% 94.2% +0.2% (better model matching)

The real win? The client now has better performance AND lower costs. That's the point of optimization—efficiency compounds.


Your Turn: The Checklist

You can apply this to any AI project. Here's what to do Monday morning:

Week 1: Audit

  • [ ] Enable detailed logging on all API calls (model, tokens, task type)
  • [ ] Collect one week of data
  • [ ] Categorize tasks by complexity
  • [ ] Calculate per-task costs

Week 2: Low-Hanging Fruit

  • [ ] Downgrade non-critical tasks to cheaper models (test quality first)
  • [ ] Implement response caching if you have duplicate inputs
  • [ ] Optimize prompts (remove verbosity, examples, unnecessary structure)

Week 3: Advanced Optimizations

  • [ ] Set up prompt caching for static system prompts
  • [ ] Migrate non-urgent tasks to batch processing
  • [ ] Implement request deduplication

Week 4: Measure & Iterate

  • [ ] Compare new costs vs. baseline
  • [ ] Monitor quality metrics (didn't go down, right?)
  • [ ] Automate the audit—keep logging, check monthly

Tools & Resources

Logging & Analytics:

  • Python anthropic SDK (built-in logging)
  • langsmith.com (trace API calls across frameworks)
  • Datadog / Elastic (if you're at scale)

Caching:

  • Redis (production caching)
  • LRU cache (Python functools.lru_cache for testing)
  • Claude prompt caching (native, no setup)

Batch Processing:

  • Claude Batch API (native)
  • OpenAI Batch API

Cost Calculators:

  • anthropic.com/pricing (Claude)
  • openai.com/pricing (GPT models)
  • Build your own: (input_tokens × input_rate + output_tokens × output_rate) / 1_000_000

The Bigger Picture

This client went from thinking "AI is expensive" to building a sustainable, cost-efficient AI infrastructure.

The same principles apply whether you're:

  • Building an AI chatbot for customer support
  • Automating content generation at scale
  • Running batch analysis on document collections
  • Fine-tuning models for specific tasks

The key insight: Every ₹ you save on API costs is ₹1 that can go toward better infrastructure, faster iteration, or hiring more engineers.

Start with the audit. Everything else follows.


About the Author

I'm Archit Mittal (@automate-archit), an automation engineer helping companies save money and time with AI workflows. I've helped 30+ clients optimize their AI spending, cut API costs by 50-85%, and build sustainable automation architectures.

If you're dealing with similar cost problems or want to discuss AI automation strategies, let's connect:

Feel free to share this with your team. Cost optimization benefits everyone.

Top comments (0)