DEV Community

kavyarani7
kavyarani7

Posted on

We Were Paying 3.75x More Than Necessary on Every AI API Call — Here's How We Found It

Our Anthropic bill was higher than expected. We had five engineers shipping AI features fast and zero visibility into what each one actually cost. Nobody had reviewed our AI API usage since we started building. This is what we found when we finally looked.

The Bill Arrives

Every team building with LLMs hits this moment. The API bill lands and someone asks "why is this so high?" and nobody has a good answer because nobody was watching.

We weren't being reckless. We were just building. Adding AI features, iterating on prompts, shipping. The cost conversation always felt like something to have later.

Later arrived.

What We Found

We have a service called divergence-detector.js. Its job is to run nightly, find situations where ETF flow signals contradict their underlying sector signals, and generate a 2-sentence plain-English explanation for each divergence found.

Here's the relevant part:

const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 150,
  messages: [{ role: 'user', content: prompt }]
});
Enter fullscreen mode Exit fullscreen mode

claude-sonnet-4-6. max_tokens: 150.

Let that sit for a second.

We were using Anthropic's mid-tier reasoning model — priced at $15 per million output tokens — to generate outputs capped at 150 tokens. Two sentences. Every single night.

claude-haiku-4-5 costs $4 per million output tokens and handles 2-sentence structured explanations at identical quality. We were paying 3.75x more than necessary on every single call.

Nobody noticed. It had been running for weeks.

Why This Happens to Every Team

This isn't a mistake unique to us. It's structural.

When you start building with LLMs, you default to the best model. It produces the best output. You're iterating fast, you're not optimising yet, and the cost feels abstract.

Then you ship. The feature works. You move on to the next thing. The model choice becomes load-bearing — nobody wants to touch it in case something breaks. The cost compounds quietly in the background.

Database queries get reviewed. SQL gets optimised. Indexes get added. But AI API calls? They sit in the codebase doing whatever they were doing on day one, forever.

The problem isn't that developers are careless. It's that there's no tool in the development workflow that flags this. No linter, no reviewer, no CI check. Cost is invisible until the bill arrives.

What We Built

We built a GitHub Action that scans for AI API usage on every PR and posts a cost analysis comment automatically — before anything merges.

This is what it looks like on a real PR:

GitHub PR comment showing AI Architecture Scan results with cost delta table comparing main branch at $157.81 per month versus this PR, with 1 warning for expensive model misuse in divergence-detector.js

The comment shows:

  • Cost delta vs base branch — "this PR adds +$44/month" or "no change", in the acreenshot attached, it points that this PR adds 0% extra cost.
  • Warnings for expensive model misuse with specific fix recommendations
  • Duplicate AI call patterns that should share a service layer
  • Missing retry/backoff logic that will crash under rate limits
  • Prompt caching opportunities — up to 90% input cost reduction on reused system prompts

The divergence-detector.js finding shows up as a ⚠️ WARN:

[EXPENSIVE_MODEL_FOR_CAPTION_OUTPUT] claude-sonnet-4-6 used with 
max_tokens=150. Outputs ≤300 tokens on structured inputs strongly 
suggest a classification/summarisation task — not a reasoning task. 
claude-haiku-4-5 handles these at equivalent quality and costs ~73% 
less on output tokens. Recommended: switch model and A/B test 20 
sample outputs.
Enter fullscreen mode Exit fullscreen mode

Specific file. Specific issue. Specific fix. Not a generic warning.

How It Works

It's static analysis — not runtime monitoring.

The scanner walks your JS/TS files, finds AI SDK call sites, extracts the model name and max_tokens value, and applies a set of detection rules. The compound rule that caught divergence-detector.js:

Premium model + max_tokens ≤ 300 = strong signal this is a classification or summarisation task, not a reasoning task

It also tracks a baseline. On push to main, it saves the current scan to GitHub Actions cache. On every PR, it loads that baseline and computes the delta — so you see what the PR adds to your monthly bill, not just the total.

What it catches:

  • Expensive models used for simple outputs
  • Large static system prompts missing prompt caching
  • Multiple files calling the same model that could share a service layer
  • API calls with no retry logic

What it doesn't catch:

  • Runtime-constructed prompts (dynamic content assembled at runtime)
  • Actual token consumption (for that, use Helicone or your provider's usage dashboard)
  • Cost from conversation history growth in multi-turn flows

Think of it as a linter for AI costs — it catches structural problems at commit time, not a meter that measures runtime consumption.

Supported Providers and Languages

Providers: Anthropic · OpenAI · Google Gemini · AWS Bedrock · LangChain

Languages: JavaScript · TypeScript · JSX · TSX · MJS · CJS

Add It to Any Repo in 2 Minutes

Create .github/workflows/ai-scan.yml:

name: AI Architecture Scan

on:
  pull_request:
    branches: [main, master]
  push:
    branches: [main, master]

jobs:
  ai-scan:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: kavyarani7/ai-arch-scanner@v1
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          threshold: '500'
Enter fullscreen mode Exit fullscreen mode

That's it. No API keys required. No external service. Uses GitHub Actions cache for baseline storage — zero additional permissions beyond what the Action already has.

First run establishes the baseline (no delta yet). Every PR after that shows the full cost comparison table.

What to Expect on First Run

The first time the Action runs on a PR you'll see:

📋 First scan on main — no baseline yet. Next PR will show cost delta vs this baseline.

That's normal. Push the workflow file to main first, then open a PR. The second run will show the full delta table.

Links

One Last Thing

Run it on your own codebase before you set any thresholds or gates. See what it finds. The first scan on a real production codebase almost always surfaces at least one call that makes you go "huh, why did we do it that way."

That moment is the whole point.

What's the most expensive AI pattern you've found in your own codebase? Drop it in the comments — genuinely curious what shows up across different teams.

Top comments (0)