DEV Community

Jacob Blackmer
Jacob Blackmer

Posted on

How I cut my AI API costs 79% — the boring stuff that actually worked

Look, theres a million posts about AI cost optimization that read like they were written by a consulting firm. This isn't one of those. This is what I actually did, including the parts where I felt dumb.

The situation

I run an always-on AI assistant on a VPS. It handles email, calendar, code generation, research, monitoring — basically it runs my digital life. Over the past few months I kept adding features without paying attention to costs and one day I looked at my bill: $288.61 for February.

For context thats on a €5/month Hetzner VPS. The infrastructure is cheap. The API calls are not.

Where the money was actually going

When I finally sat down and looked at task-level costs (instead of just the aggregate dashboard), I wanted to crawl under my desk.

40% of my API calls were using Claude Opus. Thats the $15 per million token model. And what was it doing? Status checks. Formatting dates. Checking if emails were unread. Generating heartbeat responses.

Its like hiring a brain surgeon to take your temperature. It works. But why.

The answer is I was lazy. Opus was the default, and I never questioned it. Thats the whole story.

What I changed (and what each thing saved)

1. Model routing — saved about 45%

This was the big one. I set up routing logic that matches tasks to models by complexity:

  • Simple stuff (status, formatting, extraction, Q&A) → Claude Haiku at $0.25/M tokens
  • Real work (analysis, writing, code review) → Claude Sonnet at $3/M
  • Important decisions (architecture, strategy) → Claude Opus at $15/M

After this, 85% of calls go to Haiku. Opus dropped from 40% usage to under 2%.

The implementation is honestly embarassing in how simple it is:

def route_model(task_type):
    if task_type in ['status', 'format', 'extract', 'classify']:
        return 'haiku'
    elif task_type in ['analysis', 'writing', 'code_review']:
        return 'sonnet'
    return 'opus'  # only for the big stuff
Enter fullscreen mode Exit fullscreen mode

Thats basically it. A three-way if statement saved me $130/month.

2. Local embeddings — saved about 15%

Every time my system searched its memory or did semantic matching, it was making an API call for embeddings. I switched to nomic-embed-text running locally through Ollama.

Quick setup:

ollama pull nomic-embed-text
Enter fullscreen mode Exit fullscreen mode

Its 274MB, runs on CPU, costs $0, and the quality is close enough for retrieval that I genuinly can't tell the difference. Saved about $40/mo.

If you're paying for embedding APIs right now and you have any kind of server or decent machine, just... stop. Run them locally. This is the lowest of low-hanging fruit.

3. Deduplication — saved about 10%

Found out 30-40% of my API calls were completely redundant. My heartbeat system was checking the same things every 15 minutes whether or not anything had changed.

Added basic state tracking — a JSON file that records what was checked and when. Before making an API call: "did this change since last time?" If no, skip it.

My heartbeat calls went from 96/day to about 5/day. Same coverage.

4. Batch scheduling — saved about 9%

Non-urgent tasks (memory consolidation, background research) got moved to cheaper models running on a schedule. Some runs on local Qwen2.5 7B for free.

What didn't work

Going 100% local. I tried. Qwen2.5 and Llama models handle simple tasks fine but fall apart on anything requiring nuanced reasoning. The quality gap led to more retries and manual fixes. Time cost > money saved. Hybrid wins.

Aggressive prompt shortening. Cut too much context and got worse outputs. Worse outputs need retries. Retries cost money. Net negative.

Token micro-optimization. Spent a few hours trying to squeeze tokens out of individual prompts. Saved maybe $2/month. Not worth it when routing saves $130.

The numbers

Thing Before After
Monthly cost $288 ~$60
Opus usage 40% of calls <2%
Haiku usage 25% 85%+
Local models 0% ~15%
Functionality Full Full (no loss)

If you want to do this today

  1. Check your API dashboard. What % of calls use your most expensive model? If its over 20%, your overpaying.
  2. Log task types. Even for a day. See what tasks are actually hitting which models.
  3. Add routing. Even a basic if/else saves money immediately.
  4. Look for redundant calls. Anything polling on a timer without checking for changes is probably wasting calls.
  5. Try local embeddings. nomic-embed-text via Ollama. Seriously just do it.

The honest takeaway

None of this is technically impressive. A three-way if statement and a JSON state file account for most of my savings. The hard part was sitting down, looking at the numbers, and admitting I'd been wasting money through laziness.

If your running AI in production and havent audited your model usage — do it this week. You'll probably find what I found: most of the cost is going to the wrong model.


Running Claude + Ollama on Hetzner. Total cost including VPS: about $80/mo for 24/7 AI automation. Happy to answer questions in the comments.

Top comments (0)