DEV Community

How I Found $1,240/Month in Wasted LLM API Costs (And Built a Tool to Find Yours)

Abid Ali on April 05, 2026

I was spending about $2,000/month on OpenAI and Anthropic APIs across a few projects. I knew some of it was wasteful. I just couldn't prove it. Th...

Read full post

Valentin Monteiro • Apr 7

The context bloat part is the scariest one honestly. I had the same issue on a conversational agent, input tokens silently ballooning with every message. Didn't catch it until I actually profiled it, and the fix was stupidly simple (sliding window + summarization).

Abid Ali • Apr 7

The silent part is what gets you — it's not like there's an error or a warning, the thing just keeps working while the bill quietly climbs. Sliding window makes sense for most cases but I'm curious about the summarization piece — did you roll your own or use something off the shelf? I've been thinking about whether summarization is worth the extra call cost vs just truncating, especially for agents where older context might still matter.

Valentin Monteiro • Apr 7

Exactly, the silent bleed is the worst part. For summarization vs truncation: I started with a basic LLM-generated summary (just a cheap model call on older context), and honestly the extra cost is negligible compared to what you save by not sending 15k tokens of raw history every call. For agents specifically, truncation is a trap. You lose the "why" behind earlier decisions and the agent starts looping or contradicting itself.

BTW I checked out your work in more detail after this. Ended up opening a PR. Might start using the tool for real, the approach makes sense for what I'm dealing with. 😊

Abid Ali • Apr 8

The "why" framing is exactly right — truncation feels safe until the agent starts making decisions that contradict something it agreed to 10 messages ago and you're left debugging behavior that makes no sense in isolation. Going to rethink the context handling recommendation in the docs based on this.

And the PR genuinely made my day — just saw it. First external contribution, and from someone who actually ran into the problem in production. That's the feedback loop I was hoping for. Will review it properly tonight and get back to you. Really appreciate it.

Abid Ali • Apr 8

Quick update — just merged your PR into main. Latency reporting with p50/p95/p99 is now part of the tool. Really appreciate it, first external contribution and it was a proper feature not just a typo fix.

Valentin Monteiro • Apr 8

Pleasure mate 🔥

Jonathan Murray • Apr 6

Caching is criminally underused for LLM calls. So many teams are re-sending identical or near-identical prompts and paying for it every time. The other big one is context window bloat - stuffing way more into the prompt than necessary because it feels safer. At $2k/month the gains from optimizing are real. Is your tool available publicly or still internal?

Abid Ali • Apr 7

Totally agree on both — caching especially feels like something everyone knows they should do but never prioritizes until the bill hurts. The near-duplicate problem is sneaky too, exact duplicates are easy to cache but prompts that are 95% the same with a different user name or timestamp still hit the API fresh every time.

Yeah it's public, just pushed it last week — pip install llm-spend-profiler, repo at github.com/BuildWithAbid/llm-cost-profiler. Still early but it detects the main patterns: duplicate calls, retry waste, context bloat, and model downgrade opportunities. Would love to know what it finds on your codebase if you try it.

Harjot Singh • May 31

$1,240/month in waste is a staggering number and totally believable - it's almost always the same culprits: a premium model doing mechanical work, the full context re-sent every call, and retries nobody's counting. The fact that you had to build a tool to even see it is the real indictment of how opaque this spend is by default.

This is the exact problem space I work in - Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel) is built around routing each phase to the right-sized model so the boring 80% never burns frontier tokens, landing a full build ~$3 flat. Your tool is the diagnosis; per-task routing is the structural cure. The natural next feature: don't just flag waste, suggest "this call class could drop to a cheaper model." What was the single biggest waste bucket you found? (Moonshift's first run is free, no card, if you want to see routing applied end-to-end.)

Socials Megallm • Apr 6

tracking waste is useful, but not every expensive call is actually a leak some teams intentionally pay for over-provisioning in staging to catch quality drops early i would rather optimize model routing first than just flagging high usage as a problem to fix

Abid Ali • Apr 7

That's a fair distinction. The tool doesn't flag high usage as a problem — it flags specific patterns that are almost always waste regardless of intent: exact duplicate calls with no caching, retry loops from parse failures, classifiers outputting one of five fixed labels on a frontier model. Those aren't over-provisioning decisions, they're usually just blind spots.

The model routing angle is interesting though — that's genuinely a different problem. Right now it detects obvious downgrade candidates based on output patterns but it's not doing smart routing. That's probably the next useful thing to build on top of this kind of usage data. Are you doing routing manually or with something like LiteLLM?