DEV Community

Mervin
Mervin

Posted on

How I Cut My LLM Costs by 90% Without Changing My App Logic

How I Cut My LLM Costs by 90% Without Changing My App Logic

There’s a particular kind of dread that comes with checking your OpenAI billing dashboard mid-month.

I’ve been building a news automation hub that runs 14 editorial workspaces — summarizing, rewriting, fact-checking, SEO-tagging, and translation pipelines around the clock.

The AI layer was already fairly optimized:

  • Groq
  • Gemini Flash
  • DeepSeek
  • OpenRouter
  • provider rotation
  • fallback logic

But the final fallback was still OpenAI, and once rate limits hit, costs climbed faster than expected.

What I needed wasn’t more routing logic.

I needed a smarter endpoint.


The Problem

My setup already rotated between multiple providers, but the architecture had a weakness:

Provider exhausted
    -> fallback
        -> OpenAI
            -> credits disappear
Enter fullscreen mode Exit fullscreen mode

The more providers I added, the messier things became:

  • more API keys
  • more retry logic
  • more conditional branches
  • more provider-specific handling

I was optimizing infrastructure with application code.

That was the mistake.


The Fix

After digging through self-hosted AI tooling, I found freellmapi.

It’s a lightweight OpenAI-compatible proxy that automatically routes requests across multiple free-tier LLM providers:

  • Groq
  • Cerebras
  • SambaNova
  • Cloudflare Workers AI
  • GitHub Models
  • OpenRouter free models
  • and others

Combined free-tier capacity: roughly 800M tokens/month.

The interesting part is that the routing happens inside the proxy — not inside your app.


My Integration

The integration took less than an hour.

1. Deploy the proxy

I ran it on my existing VPS:

  • Node.js 20
  • ~40MB idle RAM
  • localhost only

2. Add provider credentials

I added:

  • Groq key
  • Cloudflare credentials
  • OpenRouter key

inside the admin panel.


3. Point my app to a single endpoint

const client = new OpenAI({
  baseURL: "http://localhost:3001/v1",
  apiKey: process.env.LOCAL_ROUTER_KEY
});
Enter fullscreen mode Exit fullscreen mode

That was basically it.

The important detail:

I stopped specifying models for non-critical tasks.

Instead of forcing a specific provider, I let the proxy auto-route requests to whatever free provider was currently available.

App
  -> freellmapi
      -> Groq
      -> Cloudflare Workers AI
      -> Cerebras
      -> SambaNova
      -> OpenRouter
Enter fullscreen mode Exit fullscreen mode

If Groq rate-limited:

  • another provider picked up the request

If a provider became slow:

  • routing shifted automatically

My application code never needed to know.


The Result

Within 24 hours:

  • OpenAI usage dropped by ~90%
  • background AI tasks became almost entirely free-tier
  • no additional retry logic was needed

Most importantly:
I removed provider chaos from my application layer.


What I Learned

When engineers hit rate limits, the instinct is usually:

  • add more providers
  • add more fallback logic
  • add more code

But sometimes the better solution is adding an abstraction layer that absorbs the complexity for you.

Another realization:

Most AI tasks do not require a specific premium model.

For:

  • summaries
  • tagging
  • drafts
  • translations
  • background enrichment

…almost any decent modern 70B model works fine.


Caveats

Free-tier infrastructure has tradeoffs.

Some providers:

  • have cold starts
  • introduce latency spikes
  • become temporarily unavailable

For real-time user-facing chat systems, you should test failover carefully.

For async pipelines and batch jobs, though, it’s been surprisingly solid.

Also:
run this on infrastructure you control.

A proxy like this handles upstream API keys — don’t hand that responsibility to random hosted services.


Final Thought

The biggest optimization wasn’t changing models.

It was removing complexity from the layer that had to manage them.

Top comments (0)