Abid Ali

Posted on Apr 5

How I Found $1,240/Month in Wasted LLM API Costs (And Built a Tool to Find Yours)

#ai #python #tutorial #opensource

Common leaks like context bloat and caching

I was spending about $2,000/month on OpenAI and Anthropic APIs across a few projects.

I knew some of it was wasteful. I just couldn't prove it. The provider dashboards show you one number — your total bill. That's like getting an electricity bill with no breakdown. Is it the AC? The lights? The server room? No idea.

So I built a tool to find out. What it discovered was honestly embarrassing.

What I found

34% of my summarizer calls were retries. The prompt asked for JSON, but the model kept wrapping it in markdown code blocks. My parser rejected it. The retry loop ran the same call again. And again. Each retry cost money. Total waste: about $140/month — from a six-word fix I could have made months ago.

85% of my classifier calls were duplicates. Same input, same output, full price every time. No caching. 723 of 847 weekly calls were completely redundant. A simple cache would have saved $310/month.

My classifier was using GPT-4o for a yes/no task. The output was always under 10 tokens — one of five fixed labels. GPT-4o-mini produces identical results at a fraction of the cost. Savings: $71/month.

My chatbot was stuffing the entire conversation history into every call. By message 20, the input was 3,200 tokens and growing. Only the last few messages mattered. Truncating to the last 5 saves $155/month.

Total: $1,240/month in waste out of a $2,847 monthly spend. That's 43%.

The tool: LLM Cost Profiler

I packaged all of this into an open-source Python CLI. Here's how it works.

Step 1: Install

pip install llm-spend-profiler

Step 2: Wrap your client (2 lines of code)

from llm_cost_profiler import wrap
from openai import OpenAI

client = wrap(OpenAI())

That's it. Your code works exactly as before. Every API call is now silently logged to a local SQLite database. If logging fails for any reason, it fails silently — your app is never affected.

Works with Anthropic too:

from anthropic import Anthropic
client = wrap(Anthropic())

Step 3: See where your money goes

$ llmcost report

LLM Cost Report — Last 7 Days
========================================
Total: $847.32 | 2.4M tokens | 12,847 calls

By Feature:
  summarizer         $412.80  (48.7%)  ████████████████████
  chatbot            $203.11  (24.0%)  ████████████
  classifier          $89.40  (10.5%)  █████
  content_gen         $78.22   (9.2%)  ████
  extraction          $41.50   (4.9%)  ██
  untagged            $22.29   (2.6%)  █

Warnings:
  ⚠ summarizer: 34% of calls are retries ($140.15 wasted)
  ⚠ chatbot: avg 3,200 input tokens but only 180 output tokens (context bloat)
  ⚠ classifier: using gpt-4o but output is always <10 tokens (cheaper model works)

Step 4: Find the waste

$ llmcost optimize

LLM Cost Optimization Report
========================================
Current monthly spend (projected): $2,847
Potential savings found: $1,240/month (43.5%)

  #1 CACHE — classifier.py:34                        [SAVE $310/mo]
     85% of calls are exact duplicates (723 of 847/week)
     → Add @cache decorator
     Confidence: HIGH

  #2 RETRY FIX — content_gen.py:112                   [SAVE $180/mo]
     28% retry rate from JSON parse errors
     → Fix prompt to return raw JSON
     Confidence: HIGH

  #3 MODEL DOWNGRADE — classifier.py:34               [SAVE $71/mo]
     Output is always <10 tokens, one of 5 fixed labels
     → Switch gpt-4o to gpt-4o-mini
     Confidence: MEDIUM

  #4 CONTEXT BLOAT — chatbot.py:123                   [SAVE $155/mo]
     Avg 3,200 input tokens, growing over conversation
     → Truncate history to last 5 messages
     Confidence: MEDIUM

Each recommendation includes the exact file and line number, estimated monthly savings, and a confidence level.

Other features worth knowing about

llmcost hotspots — ranks your code locations by cost. Auto-detected from the Python call stack, no manual annotation needed:

Top Cost Hotspots:
  1. features/summarizer.py:47   summarize_doc()    $412.80/week   4,201 calls
  2. api/chat.py:123             handle_message()   $203.11/week   3,892 calls
  3. pipeline/classify.py:34     classify_text()     $89.40/week   2,847 calls

llmcost compare — week-over-week comparison to catch sudden spikes.

llmcost dashboard — opens a local web dashboard at localhost:8177 with treemap charts, cost timelines, and an optimization waterfall. Single HTML file, no npm, no build step.

Tagging — group costs by feature, customer, or environment:

from llm_cost_profiler import tag

with tag(feature="summarizer", customer="acme_corp"):
    response = client.chat.completions.create(...)

Caching decorator — stop paying for duplicate calls:

from llm_cost_profiler import cache

@cache(ttl=3600)
def classify_text(text):
    return client.chat.completions.create(...)

How it works under the hood

Wrapper: Transparent proxy pattern — intercepts method calls without monkey-patching.
Storage: SQLite with WAL mode at ~/.llmcost/data.db. Thread-safe.
Pricing: Built-in lookup table for OpenAI and Anthropic models.
Call site detection: Walks the Python call stack to auto-detect which function triggered each call.
Zero dependencies: Only uses the Python standard library.
Privacy: Everything stays local. Nothing is sent anywhere.

Try it on your codebase

If you're making LLM API calls in any project, I'm genuinely curious what it finds. In my experience, every codebase has at least one surprise — usually duplicate calls that nobody knew about.

GitHub: https://github.com/BuildWithAbid/llm-cost-profiler
Install: pip install llm-spend-profiler
License: MIT

If you find issues or have ideas for what else it should detect, open an issue or drop a comment here. This is my first open-source project and I'd love feedback.

Top comments (11)

Valentin Monteiro • Apr 7

The context bloat part is the scariest one honestly. I had the same issue on a conversational agent, input tokens silently ballooning with every message. Didn't catch it until I actually profiled it, and the fix was stupidly simple (sliding window + summarization).

Abid Ali • Apr 7

The silent part is what gets you — it's not like there's an error or a warning, the thing just keeps working while the bill quietly climbs. Sliding window makes sense for most cases but I'm curious about the summarization piece — did you roll your own or use something off the shelf? I've been thinking about whether summarization is worth the extra call cost vs just truncating, especially for agents where older context might still matter.

Valentin Monteiro • Apr 7

Exactly, the silent bleed is the worst part. For summarization vs truncation: I started with a basic LLM-generated summary (just a cheap model call on older context), and honestly the extra cost is negligible compared to what you save by not sending 15k tokens of raw history every call. For agents specifically, truncation is a trap. You lose the "why" behind earlier decisions and the agent starts looping or contradicting itself.

BTW I checked out your work in more detail after this. Ended up opening a PR. Might start using the tool for real, the approach makes sense for what I'm dealing with. 😊

Abid Ali • Apr 8

The "why" framing is exactly right — truncation feels safe until the agent starts making decisions that contradict something it agreed to 10 messages ago and you're left debugging behavior that makes no sense in isolation. Going to rethink the context handling recommendation in the docs based on this.

And the PR genuinely made my day — just saw it. First external contribution, and from someone who actually ran into the problem in production. That's the feedback loop I was hoping for. Will review it properly tonight and get back to you. Really appreciate it.

Abid Ali • Apr 8

Quick update — just merged your PR into main. Latency reporting with p50/p95/p99 is now part of the tool. Really appreciate it, first external contribution and it was a proper feature not just a typo fix.

Valentin Monteiro • Apr 8

Pleasure mate 🔥

Jonathan Murray • Apr 6

Caching is criminally underused for LLM calls. So many teams are re-sending identical or near-identical prompts and paying for it every time. The other big one is context window bloat - stuffing way more into the prompt than necessary because it feels safer. At $2k/month the gains from optimizing are real. Is your tool available publicly or still internal?

Abid Ali • Apr 7

Totally agree on both — caching especially feels like something everyone knows they should do but never prioritizes until the bill hurts. The near-duplicate problem is sneaky too, exact duplicates are easy to cache but prompts that are 95% the same with a different user name or timestamp still hit the API fresh every time.

Yeah it's public, just pushed it last week — pip install llm-spend-profiler, repo at github.com/BuildWithAbid/llm-cost-profiler. Still early but it detects the main patterns: duplicate calls, retry waste, context bloat, and model downgrade opportunities. Would love to know what it finds on your codebase if you try it.

Harjot Singh • May 31

$1,240/month in waste is a staggering number and totally believable - it's almost always the same culprits: a premium model doing mechanical work, the full context re-sent every call, and retries nobody's counting. The fact that you had to build a tool to even see it is the real indictment of how opaque this spend is by default.

This is the exact problem space I work in - Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel) is built around routing each phase to the right-sized model so the boring 80% never burns frontier tokens, landing a full build ~$3 flat. Your tool is the diagnosis; per-task routing is the structural cure. The natural next feature: don't just flag waste, suggest "this call class could drop to a cheaper model." What was the single biggest waste bucket you found? (Moonshift's first run is free, no card, if you want to see routing applied end-to-end.)

Socials Megallm • Apr 6

tracking waste is useful, but not every expensive call is actually a leak some teams intentionally pay for over-provisioning in staging to catch quality drops early i would rather optimize model routing first than just flagging high usage as a problem to fix

Abid Ali • Apr 7

That's a fair distinction. The tool doesn't flag high usage as a problem — it flags specific patterns that are almost always waste regardless of intent: exact duplicate calls with no caching, retry loops from parse failures, classifiers outputting one of five fixed labels on a frontier model. Those aren't over-provisioning decisions, they're usually just blind spots.

The model routing angle is interesting though — that's genuinely a different problem. Right now it detects obvious downgrade candidates based on output patterns but it's not doing smart routing. That's probably the next useful thing to build on top of this kind of usage data. Are you doing routing manually or with something like LiteLLM?

View full discussion (11 comments)