The $50 Wake-Up Call
Last week, I was building an AI chatbot. It was fun until I saw my OpenAI bill. I was paying for the same questions over and over again.
If 100 users ask "What is the capital of France?", why am I paying OpenAI 100 times to generate the same word: "Paris"?
I realized I needed a Caching Layer. But not just any cache—a Semantic Cache.
The Architecture
I decided to build a middleware that sits between my app and OpenAI.
- Language: Go (Golang) for high concurrency.
- Cache: Redis (for speed) + Pinecone (for vector search).
- Routing: Automatically switches between GPT-4 and Claude 3.
The "Secret Sauce": Vector Caching
A normal cache (Key-Value) is dumb.
- User A: "Hello" -> Cache Miss
- User B: "Hi there" -> Cache Miss
They mean the same thing, but the strings are different.
I used OpenAI Embeddings to turn text into vectors (lists of numbers). Now, if the cosine similarity between two prompts is > 0.9, I serve the cached answer.
Result: My API bill dropped by 90% overnight.
Building the "Universal Router"
I didn't want to be locked into OpenAI forever. So I built a Provider interface in Go.
type AIProvider interface {
Send(prompt string) (string, error)
}
Now, my users can switch models instantly just by changing a JSON parameter:
{
"model": "claude-3-opus",
"message": "Write a poem about Rust."
}
The Result: Nexus Gateway
I realized other developers have this exact problem. So I polished the code, added Stripe for billing, built a Next.js Dashboard, and open-sourced it.
I even published a Python SDK so you can use it in 3 lines of code:
pip install nexus-gateway
from nexus_gateway import NexusClient
client = NexusClient(api_key="nk-your-key")
response = client.chat("Hello World")
Try it out
It’s live, open-source, and free to try.
- Live Demo: https://nexus-gateway.org
- GitHub: [https://github.com/ANANDSUNNY0899/NexusGateway]
If you are building AI apps and want to save money, give it a shot. I’d love your feedback!
Top comments (4)
That is Genius, I like what you did
Thanks, Ahmad! I appreciate that. 🙌
It was a fun challenge trying to squeeze performance out of Go and Redis.
If you have any Python projects running, give the SDK a try (pip install nexus-gateway) and let me know if it helps your latency!
You can also use the 'max_tokens' parameter to save more
That is a great point! Limiting max_tokens definitely stops the model from rambling and costing too much.
My goal with Nexus was to bring the cost to $0 for repeated queries. If the answer is in the cache, we don't even hit the API, so max_tokens doesn't even matter!
But you are right—combining max_tokens (for misses) + Nexus Caching (for hits) would be the ultimate money saver. I might add a default limit feature in v2. Thanks for the tip!