DEV Community

Cover image for How I Built a Golang AI Gateway to Cut OpenAI Costs by 90%
SUNNY ANAND
SUNNY ANAND

Posted on

How I Built a Golang AI Gateway to Cut OpenAI Costs by 90%

The $50 Wake-Up Call

Last week, I was building an AI chatbot. It was fun until I saw my OpenAI bill. I was paying for the same questions over and over again.

If 100 users ask "What is the capital of France?", why am I paying OpenAI 100 times to generate the same word: "Paris"?

I realized I needed a Caching Layer. But not just any cache—a Semantic Cache.

The Architecture
I decided to build a middleware that sits between my app and OpenAI.

  • Language: Go (Golang) for high concurrency.
  • Cache: Redis (for speed) + Pinecone (for vector search).
  • Routing: Automatically switches between GPT-4 and Claude 3.

The "Secret Sauce": Vector Caching
A normal cache (Key-Value) is dumb.

  • User A: "Hello" -> Cache Miss
  • User B: "Hi there" -> Cache Miss

They mean the same thing, but the strings are different.

I used OpenAI Embeddings to turn text into vectors (lists of numbers). Now, if the cosine similarity between two prompts is > 0.9, I serve the cached answer.

Result: My API bill dropped by 90% overnight.

Building the "Universal Router"

I didn't want to be locked into OpenAI forever. So I built a Provider interface in Go.

type AIProvider interface {
    Send(prompt string) (string, error)
}
Enter fullscreen mode Exit fullscreen mode

Now, my users can switch models instantly just by changing a JSON parameter:

{
  "model": "claude-3-opus",
  "message": "Write a poem about Rust."
}
Enter fullscreen mode Exit fullscreen mode

The Result: Nexus Gateway
I realized other developers have this exact problem. So I polished the code, added Stripe for billing, built a Next.js Dashboard, and open-sourced it.

I even published a Python SDK so you can use it in 3 lines of code:

pip install nexus-gateway
Enter fullscreen mode Exit fullscreen mode
from nexus_gateway import NexusClient

client = NexusClient(api_key="nk-your-key")
response = client.chat("Hello World")
Enter fullscreen mode Exit fullscreen mode

Try it out
It’s live, open-source, and free to try.

If you are building AI apps and want to save money, give it a shot. I’d love your feedback!

Top comments (4)

Collapse
 
ahmad_shokry profile image
Ahmad Shokry

That is Genius, I like what you did

Collapse
 
sunny_anand_dev profile image
SUNNY ANAND

Thanks, Ahmad! I appreciate that. 🙌
It was a fun challenge trying to squeeze performance out of Go and Redis.
If you have any Python projects running, give the SDK a try (pip install nexus-gateway) and let me know if it helps your latency!

Collapse
 
ahmad_shokry profile image
Ahmad Shokry

You can also use the 'max_tokens' parameter to save more

Collapse
 
sunny_anand_dev profile image
SUNNY ANAND

That is a great point! Limiting max_tokens definitely stops the model from rambling and costing too much.
My goal with Nexus was to bring the cost to $0 for repeated queries. If the answer is in the cache, we don't even hit the API, so max_tokens doesn't even matter!
But you are right—combining max_tokens (for misses) + Nexus Caching (for hits) would be the ultimate money saver. I might add a default limit feature in v2. Thanks for the tip!