SUNNY ANAND

Posted on Jan 6

How I Built a Golang AI Gateway to Cut OpenAI Costs by 90%

#go #ai #systemdesign #opensource

The $50 Wake-Up Call

Last week, I was building an AI chatbot. It was fun until I saw my OpenAI bill. I was paying for the same questions over and over again.

If 100 users ask "What is the capital of France?", why am I paying OpenAI 100 times to generate the same word: "Paris"?

I realized I needed a Caching Layer. But not just any cache—a Semantic Cache.

The Architecture
I decided to build a middleware that sits between my app and OpenAI.

Language: Go (Golang) for high concurrency.
Cache: Redis (for speed) + Pinecone (for vector search).
Routing: Automatically switches between GPT-4 and Claude 3.

The "Secret Sauce": Vector Caching
A normal cache (Key-Value) is dumb.

User A: "Hello" -> Cache Miss
User B: "Hi there" -> Cache Miss

They mean the same thing, but the strings are different.

I used OpenAI Embeddings to turn text into vectors (lists of numbers). Now, if the cosine similarity between two prompts is > 0.9, I serve the cached answer.

Result: My API bill dropped by 90% overnight.

Building the "Universal Router"

I didn't want to be locked into OpenAI forever. So I built a Provider interface in Go.

type AIProvider interface {
    Send(prompt string) (string, error)
}

Now, my users can switch models instantly just by changing a JSON parameter:

{
  "model": "claude-3-opus",
  "message": "Write a poem about Rust."
}

The Result: Nexus Gateway
I realized other developers have this exact problem. So I polished the code, added Stripe for billing, built a Next.js Dashboard, and open-sourced it.

I even published a Python SDK so you can use it in 3 lines of code:

pip install nexus-gateway

from nexus_gateway import NexusClient

client = NexusClient(api_key="nk-your-key")
response = client.chat("Hello World")

Try it out
It’s live, open-source, and free to try.

Live Demo: https://nexus-gateway.org
GitHub: [https://github.com/ANANDSUNNY0899/NexusGateway]

If you are building AI apps and want to save money, give it a shot. I’d love your feedback!

Top comments (6)

Raj Dutta • Jan 11

This is honestly a great example of solving a real pain point instead of adding more AI fluff 👏

The semantic caching approach is 🔥 — especially using embeddings + cosine similarity instead of naïve key-value caching. That “same intent, different wording” problem is exactly where most teams burn money without realizing it.

Really like the provider abstraction too. Decoupling the app from a single LLM vendor is something more people should be thinking about early, and your Go interface design keeps it clean and extensible.

Redis + vector DB + smart routing is a solid, production-grade architecture. Cutting costs by ~90% is not a “nice to have”, that’s a business-level win.

Going to dig into the repo — this is the kind of infra thinking that scales AI products properly. Great work 🚀

SUNNY ANAND • Jan 11

Thanks, Raj! Really appreciate the detailed feedback. 🙌
You nailed it—I wanted to avoid "AI Fluff" and focus purely on the infrastructure bottleneck (cost/latency). The Semantic Cache (Pinecone) was the game changer for handling those naive key-value misses.
Let me know if you spot anything in the repo that could be optimized! Always looking to improve the Go implementation

Ahmad Shokry • Jan 7

That is Genius, I like what you did

SUNNY ANAND • Jan 7

Thanks, Ahmad! I appreciate that. 🙌
It was a fun challenge trying to squeeze performance out of Go and Redis.
If you have any Python projects running, give the SDK a try (pip install nexus-gateway) and let me know if it helps your latency!

Ahmad Shokry • Jan 7

You can also use the 'max_tokens' parameter to save more

SUNNY ANAND • Jan 7

That is a great point! Limiting max_tokens definitely stops the model from rambling and costing too much.
My goal with Nexus was to bring the cost to $0 for repeated queries. If the answer is in the cache, we don't even hit the API, so max_tokens doesn't even matter!
But you are right—combining max_tokens (for misses) + Nexus Caching (for hits) would be the ultimate money saver. I might add a default limit feature in v2. Thanks for the tip!