Pranay Batta

Posted on Jan 7

Building a RAG-Powered Documentation Assistant: Why I Used Bifrost LLM Gateway Instead of Direct API Calls

#programming #ai #opensource #rag

I needed to build an internal documentation assistant for our engineering team. Nothing fancy - just a RAG system that could answer questions about our codebase, API docs, and internal wikis.

Started simple: embed documents, store in a vector DB, retrieve relevant chunks, send to GPT-4. Standard RAG pipeline. Worked fine in testing.

Then I deployed it internally and everything got complicated fast.

The Problems That Showed Up In Production

Problem 1: No idea which queries were expensive

Some questions cost $0.02 to answer. Others cost $0.40. I had no visibility into why or which types of questions were burning tokens. Was it retrieving too many chunks? Generating overly verbose answers? No clue.

Problem 2: OpenAI outages killed the whole tool

When OpenAI went down (which happens more often than you'd think), our docs assistant just stopped working. Engineers would ping me asking if it's broken. Yeah, it's broken. Nothing I can do about it.

Problem 3: Similar questions hit the API repeatedly

"How do I authenticate API requests?" and "What's the auth flow for the API?" are semantically identical. But we were hitting OpenAI for both, paying twice for the same answer.

Traditional response caching didn't help because the questions are worded differently.

Problem 4: Debugging was painful

When someone said "the bot gave me a wrong answer," I had to dig through application logs to find the request, try to reconstruct what context was retrieved, and guess at what went wrong.

No systematic way to see: what did we retrieve? What did we send to the LLM? What did it return?

Why I Chose Bifrost Instead of Building This Myself

I could have built logging, caching, failover logic, and monitoring myself. Probably would have taken a few weeks. But I wanted to ship the docs assistant, not build LLM infrastructure.

Bifrost is an open-source LLM gateway. You point your OpenAI client at it instead of directly at OpenAI, and it handles routing, caching, failover, and observability.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

The setup was genuinely simple. Changed one line:

python

client = OpenAI(
    base_url="http://localhost:8080/openai",
    api_key=os.getenv("BIFROST_KEY")
)

Everything else in my code stayed the same. All the OpenAI SDK features still work - streaming, function calling, vision, everything.

What It Actually Solved

Semantic Caching Without Building It

Bifrost's semantic caching uses vector similarity to match questions that mean the same thing but are worded differently.

When someone asks "How do I reset my API key?" and later someone asks "What's the process for regenerating API keys?", the second query hits the cache instead of calling OpenAI.

I didn't have to build embedding generation, vector comparison, or cache invalidation logic. Just enabled it in the config and it worked.

The cache hit rate for our docs assistant stabilized around 35-40%. That's a meaningful chunk of requests that don't cost anything and return instantly.

Automatic Failover to Anthropic

I added Claude as a backup provider in the Bifrost config. When OpenAI has issues, Bifrost automatically routes to Claude instead.

This happened twice in the first month. OpenAI had elevated error rates for about an hour both times. Bifrost failed over automatically. Our team didn't even notice - the docs assistant kept working.

I got Slack notifications from Bifrost about the failover, but users just got their answers.

Actual Visibility Into What's Happening

Bifrost logs every request with the full prompt, response, token counts, latency, and cost. When someone reports a bad answer, I can search for their query and see exactly what context was retrieved and what the LLM returned.

This made debugging way faster. I found issues like:

Retrieval returning relevant chunks but the LLM ignoring them
Certain question patterns triggering overly long responses
Edge cases where we retrieved the wrong documentation version

Without request-level logging, I would've never caught these systematically.

Per-Feature Cost Tracking

I tagged different use cases with different virtual keys:

Code questions
API documentation queries
Deployment questions
General wiki search

Now I can see which types of questions cost the most. Turns out deployment questions are way more expensive because they trigger longer context and more detailed responses.

This helped me optimize where it actually mattered instead of guessing.

The Unexpected Benefits

Testing prompts before deploying them

Bifrost has a UI where you can test prompt changes against your recent queries before updating the production system. I can see how a new prompt would have answered the last 50 questions and compare quality.

Saved me from deploying a "better" prompt that actually made answers worse for a specific class of questions.

No vendor lock-in

Because Bifrost abstracts the provider, I can experiment with different models without changing application code. I've tested routing some queries to Claude to compare answer quality.

If OpenAI pricing changes or a better model comes out, I can switch in the config rather than rewriting code.

Rate limiting I didn't know I needed

We have a few power users who ask a lot of questions. Bifrost's rate limiting prevents any single person from burning through budget during their debugging sessions.

Not a huge issue now, but good to have built in.

What I'd Build Differently Next Time

If I were building another LLM-powered tool, I'd deploy Bifrost from the start instead of adding it later.

The visibility alone is worth it. Knowing what queries cost, which ones fail, and where latency comes from helps you make better product decisions early.

I'd also think harder about semantic caching upfront. The 35-40% cache hit rate I'm seeing means I could've saved a lot on duplicate questions if I'd enabled it on day one.

The Architecture Now

The stack is pretty simple:

User asks question in Slack
Bot retrieves relevant docs from Weaviate
Sends query + context to Bifrost
Bifrost routes to OpenAI (or Claude if OpenAI is down)
Checks semantic cache first
Returns answer to user

Bifrost handles caching, failover, logging, and cost tracking. My application just makes a standard OpenAI API call and everything else happens transparently.

Running It Self-Hosted

I'm running Bifrost on a small EC2 instance. It's a single Go binary - no Python runtime, no dependency management, just a compiled executable.

Memory usage sits around 100-150MB. Handles our query volume (couple hundred requests per day) without breaking a sweat.

The semantic caching uses Weaviate for vector storage. I'm running that self-hosted too. Total infrastructure cost for the LLM gateway layer is minimal.

Would I Recommend It?

Yes, especially if you're building anything that makes repeated LLM calls in production.

The problems Bifrost solves aren't obvious until you hit them. You don't think about failover until OpenAI goes down and your product stops working. You don't think about semantic caching until you realize you're paying for the same answer 40 times.

For a documentation assistant specifically, semantic caching is a huge win. People ask the same questions in different ways constantly.

The Setup

Getting started took maybe 30 minutes including reading the docs:

Pull the Docker image or download the binary
Configure API keys for OpenAI (and optionally other providers)
Point your OpenAI client at Bifrost instead of api.openai.com
Enable semantic caching if you want it

That's basically it. The rest is just configuration tuning based on your use case.

Links:

GitHub: https://github.com/maximhq/bifrost
Documentation: https://docs.getbifrost.ai

The Takeaway

Building LLM features is easy. Operating them reliably in production is harder.

Bifrost gave me failover, caching, observability, and cost control without adding complexity to my application code. Still just calling the standard OpenAI SDK, but with infrastructure that actually helps when things break.

For anyone building RAG systems or other LLM-powered tools: add an abstraction layer early. Future you will appreciate it when you need to debug a production issue or when your LLM provider has an outage.

DEV Community