Pranay Batta

Posted on Jan 7

How I Cut My AI App Costs by 52% Without Changing a Single Line of Code

#ai #programming #go #opensource

I've been running a customer support automation tool for about six months now. It handles around 15,000 conversations per month across email and chat, all powered by LLMs. The product works great. The AWS bill? Not so much.

Last month I hit $6,200 in LLM costs alone. For a bootstrapped SaaS doing maybe $18K MRR, that's not sustainable. I needed to figure out where the money was going and how to stop the bleeding without rebuilding everything.

The Problem: Zero Visibility Into What Was Actually Expensive

My setup was straightforward: Next.js frontend, Node backend, direct API calls to OpenAI for chat completions. Simple, clean, worked fine.

The issue was I had no idea which features were burning through tokens. Was it the email summarization? The suggested response generation? The sentiment analysis that runs on every message? No clue.

I could see my total OpenAI bill. I could see request counts in my logs. But I couldn't connect "this specific feature costs $2,000/month" to make informed decisions about what to optimize.

What I Tried First (That Didn't Work)

Attempt 1: Manual logging

I added logging around every LLM call to track tokens. Realized I was counting input tokens but missing output tokens half the time. Also realized streaming responses don't give you token counts until the end, which my logging wasn't handling.

Gave up after two days of unreliable data.

Attempt 2: Cheaper models

Switched from GPT-4 to GPT-3.5 for "simple" tasks. Saved maybe 15% but quality dropped noticeably. Users started complaining about worse suggested responses.

Rolled it back.

Attempt 3: Prompt optimization

Spent a week shortening prompts, removing examples, cutting system messages. Saved another 10% on token costs but introduced new bugs where the model misunderstood instructions.

Not worth the engineering time.

Then I Found Bifrost

I was looking for LLM observability tools and kept seeing Bifrost mentioned as an "LLM gateway." Honestly thought it was overkill for my use case. I don't need multi-provider routing or enterprise governance - I just wanted to know where my money was going.

But I tried it anyway because setup looked trivial. Literally one line of code change.

Before:

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

After:

const openai = new OpenAI({
  baseURL: "http://localhost:8080/openai",
  apiKey: "bifrost-key"
});

That's it. Bifrost sits between my app and OpenAI. Everything else stayed the same.

What I Learned In The First Week

Bifrost has a built-in dashboard that shows cost per endpoint, per user, per model. Within a week I had data I'd been guessing at for months.

The findings were brutal:

Email summarization was 61% of my total costs. This feature runs on every incoming email before a human even sees it. Turns out most emails don't need AI summarization - they're one sentence questions. I was using GPT-4 to summarize "What's my order status?" into... "User is asking about order status."
One customer was responsible for 18% of my monthly bill. They had integrated our API into their workflow and were hammering it with requests. Not their fault - we didn't have rate limiting. Just didn't know it was happening.
Our "sentiment analysis" feature was nearly useless and expensive. We ran sentiment analysis on every message. The data showed we never actually used the sentiment score for anything. Cut the entire feature. Saved $800/month.

The Changes I Made (And The Results)

Armed with actual data, I made four changes:

1. Switched Email Summarization to GPT-3.5-Turbo

Only for emails under 100 words. Longer emails still use GPT-4. Quality stayed the same (because short emails don't need the reasoning power of GPT-4). Cost dropped 42% on this feature alone.

2. Added Per-Customer Rate Limiting

Bifrost has built-in rate limiting per virtual key. I created different keys for different customer tiers. High-paying customers get higher limits. Free tier gets throttled.

The customer burning 18% of my budget was on the free tier. They're now limited to 100 requests/day. Offered them a paid plan if they need more. They upgraded.

3. Enabled Semantic Caching

This was the feature I didn't even know I needed.

Bifrost has semantic caching built in. It uses vector similarity to catch questions that are semantically the same even if worded differently.

Example:

"How do I reset my password?"
"I forgot my password, what should I do?"
"Can't log in, need password reset"

All three hit the same cache. Instead of three API calls to OpenAI at $0.03 each, it's one API call and two cache hits at basically zero cost.

After enabling it, my cache hit rate stabilized around 47%. That's 47% of my LLM requests that don't hit OpenAI at all.

4. Set Up Cost Alerts

Bifrost lets you set budget limits per key. I created alerts at 80% of monthly budget. Now if costs spike unexpectedly, I know within hours instead of when the bill arrives.

The Results After 60 Days

Month 1 (Before Bifrost): $6,200 in LLM costs

Month 2 (After changes): $2,950 in LLM costs

52% reduction. Same features. Same quality. Zero changes to application code beyond pointing to Bifrost instead of OpenAI directly.

The breakdown of savings:

Semantic caching: ~$1,800/month saved
Smarter model selection: ~$900/month saved
Rate limiting abusive usage: ~$400/month saved
Cutting useless features: ~$800/month saved

The Secondary Benefits I Didn't Expect

Automatic Failover

Bifrost can route to multiple providers. I added Anthropic (Claude) as a backup. When OpenAI had that 4-hour outage last month, Bifrost automatically failed over to Claude. My users didn't notice. I only knew because I checked the dashboard and saw the traffic shift.

Before Bifrost, that outage would have meant 4 hours of my product being completely down.

Better Debugging

The request logs in Bifrost show the full prompt, response, token counts, and latency for every call. When users report issues, I can search for their conversation and see exactly what the LLM received and returned.

Way better than my previous setup of grepping through application logs hoping I logged the right thing.

No Vendor Lock-in

Because Bifrost abstracts the provider, I can test different models without changing code. I've run experiments routing 10% of traffic to Claude to compare quality. If OpenAI pricing changes, I can switch providers in the config, not in the codebase.

What I'd Do Differently

If I were starting over, I'd deploy Bifrost on day one instead of six months in.

The visibility alone is worth it. Even if you're not optimizing costs yet, knowing where your money goes helps you make better product decisions.

I'd also enable semantic caching immediately. The 47% cache hit rate I'm seeing now means I wasted ~$3,000 in the first six months on duplicate requests.

The Technical Setup (For Anyone Curious)

I'm running Bifrost self-hosted on a t3.small EC2 instance ($15/month). It handles 15,000 requests/month with zero issues. Memory usage sits around 120MB.

Semantic caching uses Weaviate for vector storage. I'm running the free self-hosted version. Total infrastructure cost for the LLM gateway: $15/month.

The cost savings paid for itself in the first week.

Is This Just For Cost Optimization?

No. The cost stuff is what got my attention, but Bifrost turned into my LLM infrastructure layer.

It handles:

Routing (OpenAI for most, Claude for longer context)
Caching (semantic similarity)
Rate limiting (per customer tier)
Failover (automatic backup to Claude)
Observability (request logs, cost tracking, latency)
Governance (budget limits, usage alerts)

All without adding complexity to my application code. My backend still just calls openai.chat.completions.create() and everything else happens transparently.

The Bottom Line

I cut my AI costs in half without changing my product, degrading quality, or spending weeks on optimization.

The key was having visibility into what was actually expensive, then making targeted changes instead of guessing.

If you're running LLM-powered features in production and you don't have per-endpoint cost tracking, you're flying blind. Bifrost gave me the data I needed to stop wasting money.

Setup:

GitHub: https://github.com/maximhq/bifrost
Docs: https://docs.getbifrost.ai
Takes about 10 minutes to get running locally

For anyone building with LLMs: add observability before you need it. Future you will thank you when the bill arrives.

DEV Community