Look, theres a million posts about AI cost optimization that read like they were written by a consulting firm. This isn't one of those. This is what I actually did, including the parts where I felt dumb.
The situation
I run an always-on AI assistant on a VPS. It handles email, calendar, code generation, research, monitoring — basically it runs my digital life. Over the past few months I kept adding features without paying attention to costs and one day I looked at my bill: $288.61 for February.
For context thats on a €5/month Hetzner VPS. The infrastructure is cheap. The API calls are not.
Where the money was actually going
When I finally sat down and looked at task-level costs (instead of just the aggregate dashboard), I wanted to crawl under my desk.
40% of my API calls were using Claude Opus. Thats the $15 per million token model. And what was it doing? Status checks. Formatting dates. Checking if emails were unread. Generating heartbeat responses.
Its like hiring a brain surgeon to take your temperature. It works. But why.
The answer is I was lazy. Opus was the default, and I never questioned it. Thats the whole story.
What I changed (and what each thing saved)
1. Model routing — saved about 45%
This was the big one. I set up routing logic that matches tasks to models by complexity:
- Simple stuff (status, formatting, extraction, Q&A) → Claude Haiku at $0.25/M tokens
- Real work (analysis, writing, code review) → Claude Sonnet at $3/M
- Important decisions (architecture, strategy) → Claude Opus at $15/M
After this, 85% of calls go to Haiku. Opus dropped from 40% usage to under 2%.
The implementation is honestly embarassing in how simple it is:
def route_model(task_type):
if task_type in ['status', 'format', 'extract', 'classify']:
return 'haiku'
elif task_type in ['analysis', 'writing', 'code_review']:
return 'sonnet'
return 'opus' # only for the big stuff
Thats basically it. A three-way if statement saved me $130/month.
2. Local embeddings — saved about 15%
Every time my system searched its memory or did semantic matching, it was making an API call for embeddings. I switched to nomic-embed-text running locally through Ollama.
Quick setup:
ollama pull nomic-embed-text
Its 274MB, runs on CPU, costs $0, and the quality is close enough for retrieval that I genuinly can't tell the difference. Saved about $40/mo.
If you're paying for embedding APIs right now and you have any kind of server or decent machine, just... stop. Run them locally. This is the lowest of low-hanging fruit.
3. Deduplication — saved about 10%
Found out 30-40% of my API calls were completely redundant. My heartbeat system was checking the same things every 15 minutes whether or not anything had changed.
Added basic state tracking — a JSON file that records what was checked and when. Before making an API call: "did this change since last time?" If no, skip it.
My heartbeat calls went from 96/day to about 5/day. Same coverage.
4. Batch scheduling — saved about 9%
Non-urgent tasks (memory consolidation, background research) got moved to cheaper models running on a schedule. Some runs on local Qwen2.5 7B for free.
What didn't work
Going 100% local. I tried. Qwen2.5 and Llama models handle simple tasks fine but fall apart on anything requiring nuanced reasoning. The quality gap led to more retries and manual fixes. Time cost > money saved. Hybrid wins.
Aggressive prompt shortening. Cut too much context and got worse outputs. Worse outputs need retries. Retries cost money. Net negative.
Token micro-optimization. Spent a few hours trying to squeeze tokens out of individual prompts. Saved maybe $2/month. Not worth it when routing saves $130.
The numbers
| Thing | Before | After |
|---|---|---|
| Monthly cost | $288 | ~$60 |
| Opus usage | 40% of calls | <2% |
| Haiku usage | 25% | 85%+ |
| Local models | 0% | ~15% |
| Functionality | Full | Full (no loss) |
If you want to do this today
- Check your API dashboard. What % of calls use your most expensive model? If its over 20%, your overpaying.
- Log task types. Even for a day. See what tasks are actually hitting which models.
- Add routing. Even a basic if/else saves money immediately.
- Look for redundant calls. Anything polling on a timer without checking for changes is probably wasting calls.
- Try local embeddings. nomic-embed-text via Ollama. Seriously just do it.
The honest takeaway
None of this is technically impressive. A three-way if statement and a JSON state file account for most of my savings. The hard part was sitting down, looking at the numbers, and admitting I'd been wasting money through laziness.
If your running AI in production and havent audited your model usage — do it this week. You'll probably find what I found: most of the cost is going to the wrong model.
Running Claude + Ollama on Hetzner. Total cost including VPS: about $80/mo for 24/7 AI automation. Happy to answer questions in the comments.
Top comments (2)
The short version: I was running Claude Opus for everything including status checks and date formatting. Thats like hiring a lawyer to answer the phone. A simple routing layer that sends routine tasks to Haiku ($0.25/M tokens) instead of Opus ($15/M) was 45% of the savings by itself.
Second biggest win was local embeddings — nomic-embed-text via Ollama. Runs on CPU, 274MB, quality is fine for retrieval. Was paying ~$40/mo for something I can do for free.
I tried going 100% local first. Quality on complex tasks wasn't there yet — retries and manual fixes ate the savings. Hybrid (local for simple stuff, API for anything needing real reasoning) won pretty clearly.
The embarassing part is how simple the fix was. A three-way if statement and a state file for deduplication. Most of the waste was just defaulting to the expensive model because I never questioned it.
Stack is Claude (Haiku/Sonnet/Opus) + Ollama on a Hetzner VPS. ~$80/mo total for 24/7 AI automation.
The caching angle is massively underrated. I did almost the exact same optimization path — started with model downgrades, moved to prompt compression, then finally landed on aggressive caching and the numbers were staggering. One thing I'd add: semantic caching, not just exact-match. We use a vector similarity threshold to serve cached responses for 'close enough' prompts, which catches a lot of repeated queries that differ only in formatting or minor phrasing. Combined with request batching for non-real-time use cases, we cut costs by roughly 70% on our summarization pipeline. The upfront engineering work takes maybe a week but the payoff is permanent.