Right, so. This is a post I wish existed six months ago when we were first wiring LLMs into our Go backend at Huma.
Most of the tutorials out there for LLM integration assume you're in Python. Which is fine — a lot of ML infrastructure is Python, and libraries like LangChain, LiteLLM, and friends are well-documented. But if you're running a Go service stack and you want to add LLM calls without bolting on a whole Python sidecar, the path is less obvious.
Here's what we actually learned, including where we went wrong.
The problem we were solving
We build remote patient monitoring software. Clinicians use dashboards to track patients with chronic conditions — vitals, medication adherence, care notes. We added an LLM-powered summarization layer: given a week's worth of patient data, produce a brief natural-language summary for the clinician at the start of a shift.
Simple enough use case. The constraints were:
- Go service stack (everything is Go at Huma, has been for years)
- Latency-sensitive — this is in a hot path used during ward rounds, so we had a soft target of sub-1s end-to-end
- HIPAA-relevant, so we needed to be clear about data routing
- Needed provider failover — if OpenAI has an outage during peak hours, we can't just have it be broken
Attempt 1: Python sidecar
The first thing we did was the obvious thing: stood up a small FastAPI service in a sidecar container running LiteLLM. Our Go services would call the sidecar over HTTP, sidecar would call the provider.
It worked. And then we measured it.
p50 latency (model only): ~400ms
p50 latency (sidecar added): ~950ms
p99 latency (model only): ~800ms
p99 latency (sidecar added): ~1,400ms
The sidecar overhead was 500-600ms. Not at the HTTP level — at the total stack level, including the Python runtime, the LiteLLM library initialisation overhead on cold requests, and the round trip.
We spent a week trying to optimise it. Got it down to about 300ms overhead. Still not good enough.
Why Python proxies are painful for Go services
This isn't a Python criticism post. Python is fine for what it's for. But if you're integrating a Python proxy into a Go service, you're dealing with:
Different deployment lifecycle — your Go binary is a single static artifact. Python has a requirements.txt, a virtualenv, a cold start time, and dependencies that drift. Keeping these in sync across environments is friction.
Overhead from the library layer — libraries like LiteLLM do a lot. Request transformation, response normalisation, retry logic, fallback handling. That's all valuable, but it has a cost. When it runs in Python on every request path, you feel it.
Debugging across the boundary — when something goes wrong, your traces span two different runtimes with different logging formats. This sounds minor and isn't.
What we tried instead
I found Bifrost — an open-source LLM gateway written in Go. Apache 2.0, runs as a proper binary. The headline claim is 11µs overhead, which I was skeptical about.
It turned out to be accurate, or close enough that it doesn't matter. In practice we measured sub-1ms gateway overhead even under load. The difference from the Python sidecar was stark.
To get started locally:
npx -y @maximhq/bifrost
That spins up the gateway on localhost:8080. For production we containerise it and run it as a sidecar in the same pod as our Go service — tiny image, proper Go binary, low resource footprint.
The Go integration
Bifrost exposes a standard OpenAI-compatible HTTP API, so integrating it from Go is just HTTP client work. Here's our client struct:
package llmclient
import (
"bytes"
"context"
"encoding/json"
"fmt"
"net/http"
"time"
)
type Client struct {
baseURL string
apiKey string
httpClient *http.Client
}
func New(baseURL, apiKey string) *Client {
return &Client{
baseURL: baseURL,
apiKey: apiKey,
httpClient: &http.Client{
Timeout: 30 * time.Second,
},
}
}
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
type CompletionRequest struct {
Model string `json:"model"`
Messages []Message `json:"messages"`
Temperature float32 `json:"temperature,omitempty"`
}
type CompletionResponse struct {
ID string `json:"id"`
Choices []struct {
Message Message `json:"message"`
FinishReason string `json:"finish_reason"`
} `json:"choices"`
Usage struct {
PromptTokens int `json:"prompt_tokens"`
CompletionTokens int `json:"completion_tokens"`
} `json:"usage"`
}
func (c *Client) Complete(ctx context.Context, model, userPrompt string) (*CompletionResponse, error) {
reqBody := CompletionRequest{
Model: model,
Messages: []Message{
{Role: "user", Content: userPrompt},
},
}
body, err := json.Marshal(reqBody)
if err != nil {
return nil, fmt.Errorf("marshalling request: %w", err)
}
req, err := http.NewRequestWithContext(
ctx,
http.MethodPost,
c.baseURL+"/v1/chat/completions",
bytes.NewReader(body),
)
if err != nil {
return nil, fmt.Errorf("creating http request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Authorization", "Bearer "+c.apiKey)
resp, err := c.httpClient.Do(req)
if err != nil {
return nil, fmt.Errorf("sending request: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("upstream returned status %d", resp.StatusCode)
}
var result CompletionResponse
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return nil, fmt.Errorf("decoding response: %w", err)
}
return &result, nil
}
Nothing exotic. It's just Go HTTP. The gateway handles all the provider-specific bits.
Handling provider failover
The piece that mattered most for our use case was failover. Bifrost handles this through its routing config — you set up primary and fallback providers, and it handles the retry and fallback logic transparently. From our Go service's perspective, it's still just one HTTP call.
Here's a simplified version of how we define the fallback in the Bifrost config:
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
weight: 100
- name: anthropic
api_key: ${ANTHROPIC_API_KEY}
weight: 0 # fallback only
routing:
strategy: failover
fallback_on: [5xx, timeout]
When OpenAI returns a 5xx or times out, requests automatically fall over to Anthropic. The clinician's dashboard keeps working. Nobody gets paged.
For HIPAA purposes, we audited the data routing — Bifrost passes requests through to the configured provider, it doesn't store or log the content by default. We added our own audit logging at the Go service layer before the request goes out.
The latency after switching
p50 latency (gateway overhead): <1ms
p50 latency (total, model + gw): ~420ms
p99 latency (total): ~870ms
Down from 1,400ms p99 to ~870ms. Most of what's left is the model.
That difference is real in the context we're in. Clinicians using a tool during ward rounds notice 1.4 seconds. They don't notice 870ms.
What I'd do differently from the start
If I were starting this integration again:
- Don't reach for a Python sidecar reflexively. If your stack is Go, there are Go-native options now.
- Measure the proxy overhead early, not after you've invested. We assumed sidecar overhead would be negligible. It wasn't.
- Design failover in from the start. Bolting it on later meant a sprint of refactoring. The gateway approach made this much simpler.
-
Keep the LLM client behind an interface. We have a
Summarizerinterface in our domain layer that the LLM client implements. That means we can swap implementations in tests without touching anything else.
type Summarizer interface {
Summarize(ctx context.Context, patientData PatientWeeklySummary) (string, error)
}
// In tests:
type mockSummarizer struct{}
func (m *mockSummarizer) Summarize(_ context.Context, _ PatientWeeklySummary) (string, error) {
return "Patient stable. No significant changes.", nil
}
Standard Go stuff, but worth saying explicitly: don't let the LLM client leak into your domain logic.
The tl;dr is: if you're building Go services and need LLMs, the Python proxy path has a real latency cost that's easy to miss until you measure it. There are proper Go-native options now. Worth knowing they exist before you commit to an architecture that'll hurt you later.
Happy to share more specifics on the HIPAA logging setup or the failover config if useful — drop it in the comments.
Top comments (0)