Saikat De

Posted on May 25

How I Cut My Anthropic API Bill by 50% With a Local Python Tool

#python #ai #claude #devtool

My Anthropic bill doubled two months in a row. Not because I was building something bigger — because I kept asking the same questions, sending bloated prompts, and defaulting to Sonnet for tasks that Haiku could handle. I built a tool to fix it. Here's how it works.

The Problem

AI API costs compound fast for three reasons. First, if you're iterating on a project, you ask similar questions repeatedly — "how does X work," "what's wrong with this code" — and pay full price every time. Second, prompts accumulate context: documentation snippets, error traces, boilerplate instructions that add hundreds of tokens but contribute nothing to the answer. Third, most people just use whatever model they defaulted to first. Claude Opus at $15/1M input tokens for a query that Haiku could answer for $1/1M is a 15x cost multiplier on every single call.

The Solution

I built ai-cost-optimizer — a local CLI that sits between your terminal and the Anthropic API. It runs a semantic cache, a prompt compressor, and a model router on every request before anything hits the network. No cloud, no subscription, no data leaving your machine. Just a Python package you install once.

How It Works

1. Semantic Cache

The cache stores every response as a vector embedding. On each new request, it computes the embedding for your prompt and checks cosine similarity against everything stored. If similarity is above the threshold (default: 0.80), it returns the cached answer — no API call, zero cost.

$ aiproxy ask "What is the capital of France?"

  Model         claude-haiku-4-5-20251001
  Cached        No
  Input tokens  18
  Output tokens 9
  Cost          $0.000063
  Cache saved   $0.000000 (cached for next time)
  Total saved   $0.000000

$ aiproxy ask "Capital of France?"

  Model         claude-haiku-4-5-20251001
  Cached        Yes
  Input tokens  0
  Output tokens 0
  Cost          $0.000000
  Cache saved   $0.000063
  Total saved   $0.000063

"Capital of France?" is semantically identical to the first query. Cache hit. The API never sees it.

The cache uses sentence-transformers/all-MiniLM-L6-v2 for embeddings (80 MB, runs entirely in-process) and usearch for fast ANN lookup. Cold load is ~1.5 seconds on first run; subsequent queries are sub-100ms.

2. Prompt Compressor

Long prompts are expensive not because they're long, but because most of that length is filler. The compressor uses BM25 to score each sentence by relevance to the query, keeps the top-scoring sentences, and discards the rest.

In plain terms: it reads your prompt, figures out which sentences actually relate to what you're asking, and throws out the ones that don't. No summarization, no LLM — pure lexical scoring, deterministic, fast.

Real example from a documentation query:

Original prompt:  370 tokens  ($0.001110 at Sonnet pricing)
Compressed:        61 tokens  ($0.000183 at Sonnet pricing)
Tokens saved:     309 tokens  (83% reduction)

The threshold for compression is configurable (MAX_PROMPT_TOKENS=500 in .env). Prompts under that limit are sent as-is.

3. Model Router

The router classifies each prompt and picks the cheapest model that can handle it. The logic is rule-based: token count, keyword signals for complexity, and a few heuristics for code vs. prose vs. reasoning tasks.

Query	Routed to	Input cost per 1M tokens
"What is 2+2?"	Haiku	$1.00
"Explain binary search trees"	Sonnet	$3.00
"Review this system architecture"	Opus	$15.00

You can override the router by passing --model explicitly. But if you don't, it defaults to the cheapest model that fits the task, and in practice that means Haiku handles the majority of short factual queries.

Real Numbers

After two weeks of normal development use — asking questions about code, debugging errors, generating short snippets:

$ aiproxy stats

  Cache entries       14
  Cache hits           3
  Estimated savings   $0.000300
  Total API calls     23
  Total cost          $0.160000

23 API calls, $0.16 total
566 tokens saved by compression across those calls
12% cache hit rate (low because I was exploring new topics; repetitive workloads hit 30-40%)

The compression savings don't show in stats yet — that's a known gap I'm fixing next.

Installation

git clone https://github.com/desaikat/ai-cost-optimizer.git
cd ai-cost-optimizer
python -m venv .venv
source .venv/Scripts/activate  # Windows: .venv\Scripts\activate
pip install -e .

cp .env.example .env
# Add your ANTHROPIC_API_KEY to .env

aiproxy ask "What is the difference between a list and a tuple in Python?"

There's also a Streamlit dashboard (aiproxy-dashboard) that shows cumulative spend, cache hit rate, model distribution, and compression savings over time.

What's Next

OpenAI and Gemini support — the cache and compressor are provider-agnostic; the client layer just needs adapters
Team mode — shared SQLite or Postgres cache so a team stops paying for the same queries independently
REST API mode — drop-in proxy endpoint so any language can use it, not just Python
PyInstaller .exe — zero-install distribution for Windows users who don't want to manage a venv

Try It / Feedback Welcome

Repo: github.com/desaikat/ai-cost-optimizer

Two things I'm actively unsure about and would value input on:

Similarity threshold tuning — 0.80 works for me but may be too aggressive for code queries where small wording differences matter. What threshold makes sense for your use case?
Compression quality — BM25 is fast and deterministic but misses semantic relevance. Has anyone used a lightweight embedding-based sentence scorer that stays under 100ms per prompt?

Open issues and PRs welcome.

DEV Community