Kuldeep Paul

Posted on Dec 9

Building an LLM Gateway in Go: What We Learned

#go #ai #opensource #architecture

Why We Built Bifrost

A year ago, we were routing LLM requests through LiteLLM like everyone else. It worked fine at low scale. Then we hit production traffic.

At 500 RPS, our gateway became the bottleneck. P99 latencies spiked to 20+ seconds. Memory usage climbed continuously. The Python async overhead was killing us.

We tried optimizing. We tried throwing hardware at it. Nothing worked.

So we rebuilt the entire gateway in Go. The result is Bifrost, and it's fully open source.

The Core Architecture

Design Goal: Add <50μs overhead to every LLM request.

Key Decisions:

1. Goroutine-per-request model
Every request gets its own goroutine. Go's runtime handles scheduling. No async/await complexity, no event loop overhead.

func (g *Gateway) HandleRequest(ctx context.Context, req *schemas.CompletionRequest) { go func() { *// Request handling in isolated goroutine* result := g.processRequest(ctx, req) g.sendResponse(result) }() }

2. Zero-copy request forwarding
We avoid unnecessary serialization. Request bodies flow through Bifrost without intermediate JSON parsing where possible.

3. Async everything else
Logging, metrics, plugin execution - all non-blocking. The hot path stays fast.

`// Log asynchronously
go func() {
g.logger.Log(ctx, logEntry)
}()

// Continue processing immediately
return response`

Performance Patterns

Sync.Pool for allocations
Reduce GC pressure by reusing objects:

`var requestPool = sync.Pool{
New: func() interface{} {
return &Request{}
},
}

func getRequest() *Request {
return requestPool.Get().(*Request)
}`

Buffered channels for back-pressure
Prevent goroutine explosion under extreme load:

`requestChan := make(chan *Request, 10000)

// Producer
select {
case requestChan <- req:
// Queued successfully
case <-time.After(100 * time.Millisecond):
// Back-pressure: reject request
return ErrOverloaded
}`

Context propagation
Every request carries context for cancellation and timeouts:

`ctx, cancel := context.WithTimeout(req.Context(), 30*time.Second)
defer cancel()

resp, err := g.provider.Complete(ctx, req)`

Plugin System Design

We wanted extensibility without sacrificing performance.

Architecture:

type Plugin interface { PreHook(ctx context.Context, req *schemas.CompletionRequest) error PostHook(ctx context.Context, req *schemas.CompletionRequest, resp *schemas.CompletionResponse) error }

Plugins run in the request goroutine by default, but can spawn their own goroutines for async work:

func (p *LoggingPlugin) PostHook(ctx context.Context, req, resp) error { *// Don't block the response* go p.asyncLog(ctx, req, resp) return nil }

Load Balancing Logic

Adaptive load balancing was the hardest part. We track per-key metrics and adjust weights dynamically:

`type KeyMetrics struct {
Latency time.Duration
ErrorRate float64
LastUpdate time.Time
}

func (lb *LoadBalancer) AdjustWeights() {
for key, metrics := range lb.metrics {
score := lb.calculateScore(metrics)
lb.weights[key] = clamp(score, 0.5, 1.5)
}
}`

Keys with low latency and error rates get more traffic. Degraded keys get less.

Observability Without Overhead

Challenge: Capture every request/response without adding latency.

Solution: Async buffered logging with batch writes.

`type Logger struct {
buffer chan *LogEntry
batch []*LogEntry
}

func (l *Logger) Start() {
go func() {
ticker := time.NewTicker(100 * time.Millisecond)
for {
select {
case entry := <-l.buffer:
l.batch = append(l.batch, entry)
case <-ticker.C:
if len(l.batch) > 0 {
l.writeBatch()
}
}
}
}()
}`

Logs are buffered in memory and written in batches. Zero impact on request latency.

Memory Management

Go's GC is good, but we still optimize allocations:

1. Pre-allocate slices

`// Bad
results := []Result{}

// Good
results := make([]Result, 0, expectedSize)`

2. Reuse buffers

var bufferPool = sync.Pool{ New: func() interface{} { return new(bytes.Buffer) }, }

3. Limit goroutine lifetime
Always use context for cancellation:

go func(ctx context.Context) { select { case <-ctx.Done(): return case <-work: process() } }(ctx)

The Results

After all optimizations:

11μs gateway overhead (vs 600μs for LiteLLM)
68% less memory usage
5,000+ RPS sustained on single instance
P99 latency under 1s at 5k RPS

Open Source from Day One

We're not selling licenses. Bifrost is MIT licensed and always will be.

The repo includes:

Full source code
Benchmark suite
Docker setup
Production guides
Architecture docs

⭐ Star it on GitHub

What We'd Do Differently

1. Start with Go from the beginning
Python was never going to work at scale. Should have recognized this sooner.

2. Build observability in from day one
Retrofitting observability is hard. We built it into the core architecture.

3. Dogfood earlier
We used our own gateway from the start, which caught issues fast.

Try It

bash

git clone https://github.com/maximhq/bifrost cd bifrost docker compose up

Add your API keys at localhost:8080 and start routing. The UI shows real-time latency metrics.

DEV Community