Our Anthropic bill was higher than expected. We had five engineers shipping AI features fast and zero visibility into what each one actually cost. Nobody had reviewed our AI API usage since we started building. This is what we found when we finally looked.
The Bill Arrives
Every team building with LLMs hits this moment. The API bill lands and someone asks "why is this so high?" and nobody has a good answer because nobody was watching.
We weren't being reckless. We were just building. Adding AI features, iterating on prompts, shipping. The cost conversation always felt like something to have later.
Later arrived.
What We Found
We have a service called divergence-detector.js. Its job is to run nightly, find situations where ETF flow signals contradict their underlying sector signals, and generate a 2-sentence plain-English explanation for each divergence found.
Here's the relevant part:
const response = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 150,
messages: [{ role: 'user', content: prompt }]
});
claude-sonnet-4-6. max_tokens: 150.
Let that sit for a second.
We were using Anthropic's mid-tier reasoning model — priced at $15 per million output tokens — to generate outputs capped at 150 tokens. Two sentences. Every single night.
claude-haiku-4-5 costs $4 per million output tokens and handles 2-sentence structured explanations at identical quality. We were paying 3.75x more than necessary on every single call.
Nobody noticed. It had been running for weeks.
Why This Happens to Every Team
This isn't a mistake unique to us. It's structural.
When you start building with LLMs, you default to the best model. It produces the best output. You're iterating fast, you're not optimising yet, and the cost feels abstract.
Then you ship. The feature works. You move on to the next thing. The model choice becomes load-bearing — nobody wants to touch it in case something breaks. The cost compounds quietly in the background.
Database queries get reviewed. SQL gets optimised. Indexes get added. But AI API calls? They sit in the codebase doing whatever they were doing on day one, forever.
The problem isn't that developers are careless. It's that there's no tool in the development workflow that flags this. No linter, no reviewer, no CI check. Cost is invisible until the bill arrives.
What We Built
We built a GitHub Action that scans for AI API usage on every PR and posts a cost analysis comment automatically — before anything merges.
This is what it looks like on a real PR:
The comment shows:
- Cost delta vs base branch — "this PR adds +$44/month" or "no change", in the acreenshot attached, it points that this PR adds 0% extra cost.
- Warnings for expensive model misuse with specific fix recommendations
- Duplicate AI call patterns that should share a service layer
- Missing retry/backoff logic that will crash under rate limits
- Prompt caching opportunities — up to 90% input cost reduction on reused system prompts
The divergence-detector.js finding shows up as a ⚠️ WARN:
[EXPENSIVE_MODEL_FOR_CAPTION_OUTPUT] claude-sonnet-4-6 used with
max_tokens=150. Outputs ≤300 tokens on structured inputs strongly
suggest a classification/summarisation task — not a reasoning task.
claude-haiku-4-5 handles these at equivalent quality and costs ~73%
less on output tokens. Recommended: switch model and A/B test 20
sample outputs.
Specific file. Specific issue. Specific fix. Not a generic warning.
How It Works
It's static analysis — not runtime monitoring.
The scanner walks your JS/TS files, finds AI SDK call sites, extracts the model name and max_tokens value, and applies a set of detection rules. The compound rule that caught divergence-detector.js:
Premium model + max_tokens ≤ 300 = strong signal this is a classification or summarisation task, not a reasoning task
It also tracks a baseline. On push to main, it saves the current scan to GitHub Actions cache. On every PR, it loads that baseline and computes the delta — so you see what the PR adds to your monthly bill, not just the total.
What it catches:
- Expensive models used for simple outputs
- Large static system prompts missing prompt caching
- Multiple files calling the same model that could share a service layer
- API calls with no retry logic
What it doesn't catch:
- Runtime-constructed prompts (dynamic content assembled at runtime)
- Actual token consumption (for that, use Helicone or your provider's usage dashboard)
- Cost from conversation history growth in multi-turn flows
Think of it as a linter for AI costs — it catches structural problems at commit time, not a meter that measures runtime consumption.
Supported Providers and Languages
Providers: Anthropic · OpenAI · Google Gemini · AWS Bedrock · LangChain
Languages: JavaScript · TypeScript · JSX · TSX · MJS · CJS
Add It to Any Repo in 2 Minutes
Create .github/workflows/ai-scan.yml:
name: AI Architecture Scan
on:
pull_request:
branches: [main, master]
push:
branches: [main, master]
jobs:
ai-scan:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: kavyarani7/ai-arch-scanner@v1
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
threshold: '500'
That's it. No API keys required. No external service. Uses GitHub Actions cache for baseline storage — zero additional permissions beyond what the Action already has.
First run establishes the baseline (no delta yet). Every PR after that shows the full cost comparison table.
What to Expect on First Run
The first time the Action runs on a PR you'll see:
📋 First scan on
main— no baseline yet. Next PR will show cost delta vs this baseline.
That's normal. Push the workflow file to main first, then open a PR. The second run will show the full delta table.
Links
- GitHub Marketplace: https://github.com/marketplace/actions/ai-architecture-scanner
- Repo: https://github.com/kavyarani7/ai-arch-scanner
- Zero dependencies — pure Node.js built-ins
- Free, open source, MIT license
One Last Thing
Run it on your own codebase before you set any thresholds or gates. See what it finds. The first scan on a real production codebase almost always surfaces at least one call that makes you go "huh, why did we do it that way."
That moment is the whole point.
What's the most expensive AI pattern you've found in your own codebase? Drop it in the comments — genuinely curious what shows up across different teams.

Top comments (0)