sunyifu

Posted on Mar 12

How I Cut My AI Agent's API Bill by 60% — Without Losing Quality

#ai #opensource #openclaw #selfhosted

Two months into self-hosting my AI agent, I opened my Anthropic dashboard and saw a number I didn't love. $47 for the month. Not catastrophic, but way more than it needed to be.

After a week of tweaking, I got that down to ~$15/month — same quality for daily tasks, same channels, same skills. Here's exactly what I changed.

(If you're new here: Part 1 covers setting up OpenClaw, and Part 2 covers the Skills I use daily.)

Why Was I Overpaying?

The default setup uses one model for everything. I had Claude Sonnet 4 handling every message — including "thanks", "ok", and "what time is it?". That's like using a sports car to go get milk.

Here's what the models actually cost (as of early 2026):

Model	Input	Output	Best For
Claude Haiku 4.5	$1/1M tokens	$5/1M tokens	Quick tasks, greetings, lookups
Claude Sonnet 4	$3/1M tokens	$15/1M tokens	Balanced daily use
Claude Opus 4.6	$5/1M tokens	$25/1M tokens	Deep research, analysis

Haiku is 5x cheaper than Opus on output tokens, and 3x cheaper than Sonnet. For "what's the weather?" or "remind me to check the deployment" — Haiku handles it just fine.

Prices from Anthropic's pricing page. Check there for the latest rates.

Fix #1: Use Different Models for Different Channels

This was the single biggest win. OpenClaw lets you assign models per channel using modelByChannel in your config:

// ~/.openclaw/openclaw.json
{
  "channels": {
    "modelByChannel": {
      "whatsapp": {
        "default": "anthropic/claude-haiku-4-5"
      },
      "telegram": {
        "default": "anthropic/claude-haiku-4-5"
      },
      "discord": {
        "default": "anthropic/claude-sonnet-4"
      },
      "slack": {
        "default": "anthropic/claude-sonnet-4"
      }
    }
  }
}

My logic: WhatsApp and Telegram are mostly quick personal messages — Haiku is perfect. Discord and Slack are where I do actual work stuff (code review, debugging), so those get Sonnet.

This one change shifted about 50% of my traffic to Haiku. Immediate cost drop.

Want smarter routing? The community has built auto-routers like iblai-openclaw-router that analyze message complexity and route to Haiku/Sonnet/Opus automatically. I haven't tried it yet, but the concept is solid — route "hi" to Haiku and "analyze this architecture" to Opus, per message.

Fix #2: Limit Conversation History

This one surprised me. By default, OpenClaw sends your recent conversation history with every request — so the model has context. But that means every "yes" reply carries a long tail of previous messages worth of tokens.

OpenClaw lets you cap this with historyLimit:

{
  "messages": {
    "groupChat": {
      "historyLimit": 10
    }
  },
  "channels": {
    "whatsapp": {
      "dmHistoryLimit": 8
    },
    "telegram": {
      "dmHistoryLimit": 8
    },
    "discord": {
      "historyLimit": 15
    }
  }
}

I keep Discord higher (15) because work conversations need more context. WhatsApp and Telegram get 8 — plenty for casual back-and-forth.

Before this change, I was sending ~10,000 input tokens per request with all the history. After: ~4,000. That alone cut my input costs by more than half.

Fix #3: Use Anthropic's Prompt Caching

This one's not an OpenClaw setting — it's an Anthropic API feature. If your agent sends the same system prompt and conversation prefix with every request (which it does), prompt caching avoids reprocessing those tokens.

The savings are real:

Operation	Cost
Normal input	$3/1M tokens (Sonnet)
Cache write (first request)	$3.75/1M tokens (1.25x)
Cache read (subsequent)	$0.30/1M tokens (0.1x)

That's a 90% discount on repeated context. If your system prompt is 1,000 tokens and you send 50 messages, you pay full price once and 10% for the other 49.

OpenClaw supports this if your API provider has caching enabled. Check your Anthropic dashboard — if you see "cache read tokens" in your usage, it's already working.

Fix #4: Pick a Cheaper Default Model

Sounds obvious, but I was overthinking it. I switched my default from Sonnet to Haiku 4.5 for most channels, and honestly? For 80% of my daily interactions, I can't tell the difference.

Haiku handles:

Quick Q&A and lookups ✅
Smart home commands ✅
Simple reminders and scheduling ✅
Casual conversation ✅

Where I notice the difference: complex code review, long-form writing, and nuanced analysis. For those, I keep Sonnet (or Opus) on my work channels.

The mental shift: default cheap, upgrade where it matters — not the other way around.

Fix #5: Monitor and Set Limits

You can't optimize what you don't measure. OpenClaw has a built-in stats command:

openclaw stats

Today's Usage:
  Input tokens:  45,230
  Output tokens: 12,450
  Estimated cost: $0.23

This month:
  Total tokens: 1,234,567
  Estimated cost: $8.45

I check this weekly. It helps me spot when something's off — like that time a Discord channel was generating way more traffic than I expected.

Also set up usage limits on your Anthropic account directly. The Anthropic Console lets you configure monthly spend caps so you never get a surprise bill. I set mine at $25/month — if I ever hit it, something's wrong.

Quick Math: What Should You Expect to Pay?

Here's a rough formula:

Monthly Cost = (Daily Messages × Avg Tokens × Price per Token) × 30

For 50 messages/day with Claude Sonnet 4 (avg 800 input + 400 output tokens):

Input: 50 × 800 × ($3/1M) × 30 = $3.60
Output: 50 × 400 × ($15/1M) × 30 = $9.00
Total: ~$12.60/month

With half your traffic on Haiku 4.5 instead:

Haiku portion (25 msgs): 25 × 800 × ($1/1M) × 30 + 25 × 400 × ($5/1M) × 30 = $0.60 + $1.50 = $2.10
Sonnet portion (25 msgs): 25 × 800 × ($3/1M) × 30 + 25 × 400 × ($15/1M) × 30 = $1.80 + $4.50 = $6.30
Total: ~$8.40/month

Add prompt caching on top and you're looking at even less.

Three Setups to Get You Started

Tight budget (~$5-10/month):

Default model: Haiku 4.5 for all channels
History limit: 5 messages
Prompt caching: enabled
Good for: personal use, casual messaging

Balanced (~$15-25/month):

Haiku 4.5 for WhatsApp/Telegram, Sonnet for Discord/Slack
History limit: 8-15 messages depending on channel
Prompt caching: enabled
Good for: daily driver with some work use

Quality-first (~$30-50/month):

Sonnet for most channels, Opus for specific work channels
History limit: 20+ messages
Good for: heavy work use, code review, research

More details and full config examples on open-claw.me/blog/openclaw-model-selection-cost.

TL;DR

Route models by channel — Haiku for casual, Sonnet for work. Biggest single win.
Limit conversation history — 8-15 messages is enough for most chats.
Use prompt caching — 90% off repeated context, basically free.
Default to a cheaper model — upgrade where it matters, not everywhere.
Monitor weekly — catch surprises early, set spend caps.

These five changes took my bill from $47 to ~$15. The agent works just as well for daily tasks — it just uses the right tool for the job now.

Config details may vary by OpenClaw version — check docs.openclaw.ai for your setup.

What's your monthly AI agent bill looking like? I'm curious if anyone's running leaner — or if you've found optimizations I missed. Drop a comment.

Top comments (1)

Harjot Singh • May 31

60% without losing quality is the credible range (I get suspicious of the "95% savings!" posts because that usually means a quality cliff somewhere). 60% is what disciplined engineering actually buys: route the easy calls to a cheaper model, cache repeated context, trim the prompt, cap retries. None of those touch quality on the work that matters - they just stop you overpaying for the work that doesn't.

The "without losing quality" guard is the part I'd press on, because it's easy to claim and hard to prove. The teams that get it right have an eval set they run before and after each cost change, so "no quality loss" is measured, not hoped. That measure-don't-hope discipline is central to Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - cheap models do the bulk but every consequential step is verified, which is how a build holds ~$3 flat without quality sliding. Genuinely useful post. How did you verify quality held steady at 60% off - a formal eval set, or spot-checking outputs? The measurement method is what makes the claim trustworthy, and most people skip it.