I've spent the last two years integrating LLMs into production systems - a fintech platform processing millions of transactions and a document processing pipeline for government-scale workloads. Go is not the obvious choice for AI work. Python gets all the attention. But for production backends where reliability, cost, and latency actually matter, Go has been the better call every time.
Here's what I've learned.
Why Go for LLM pipelines?
The Python AI ecosystem is rich but chaotic. LangChain adds abstractions on top of abstractions. Dependency conflicts are constant. Async code that works in a notebook falls apart under real concurrency.
Go gives you:
- Predictable memory usage under load
- Goroutines for cheap, real concurrency (not async/await workarounds)
- Compile-time errors that catch the stupid mistakes before they hit production
- A single binary that deploys anywhere
The tradeoff is you write more code. There is no LangChain equivalent in Go. But in my experience, that is a feature. You understand exactly what your pipeline is doing.
The architecture that works
After iterating on several production systems, I keep coming back to this structure:
Request → Validator → ContextBuilder → LLMClient → ResponseParser → Cache → Response
Each stage is a function that takes input and returns output. No magic, no framework. Just functions.
type Pipeline struct {
validator Validator
contextBuilder ContextBuilder
llmClient LLMClient
parser ResponseParser
cache Cache
}
func (p *Pipeline) Run(ctx context.Context, req Request) (Response, error) {
if err := p.validator.Validate(req); err != nil {
return Response{}, fmt.Errorf("validation: %w", err)
}
context, err := p.contextBuilder.Build(ctx, req)
if err != nil {
return Response{}, fmt.Errorf("context: %w", err)
}
cacheKey := p.cache.Key(req, context)
if cached, ok := p.cache.Get(cacheKey); ok {
return cached, nil
}
raw, err := p.llmClient.Complete(ctx, context)
if err != nil {
return Response{}, fmt.Errorf("llm: %w", err)
}
result, err := p.parser.Parse(raw)
if err != nil {
return Response{}, fmt.Errorf("parse: %w", err)
}
p.cache.Set(cacheKey, result)
return result, nil
}
Simple. Each component is testable in isolation. Swap the LLM client without touching anything else.
The mistakes that cost real money
1. Not caching aggressively enough
LLM calls are expensive. In a document processing pipeline, the same context gets rebuilt and sent to the model repeatedly for similar inputs. We added a two-layer cache: in-memory for hot keys, Redis for warm keys. Cost dropped 40% overnight.
The key insight: you do not need semantic similarity matching for most caching. A hash of the normalized input is enough for 80% of cache hits.
func (c *PipelineCache) Key(req Request, ctx PipelineContext) string {
h := sha256.New()
h.Write([]byte(req.NormalizedInput()))
h.Write([]byte(ctx.SystemPrompt))
return hex.EncodeToString(h.Sum(nil))
}
2. Treating LLM errors like HTTP errors
LLMs fail in ways HTTP servers do not. You get:
- Rate limit errors (back off and retry)
- Malformed JSON responses (retry with a stricter prompt)
- Responses that are technically valid but semantically wrong (detect and escalate)
- Context window overflows (truncate and retry)
Each failure mode needs a different response. A generic retry loop handles none of them well.
func (c *LLMClient) Complete(ctx context.Context, req CompletionRequest) (string, error) {
var lastErr error
for attempt := 0; attempt < c.maxRetries; attempt++ {
resp, err := c.call(ctx, req)
if err == nil {
return resp, nil
}
var rateLimitErr *RateLimitError
var contextErr *ContextWindowError
switch {
case errors.As(err, &rateLimitErr):
wait := time.Duration(rateLimitErr.RetryAfterSeconds) * time.Second
select {
case <-time.After(wait):
case <-ctx.Done():
return "", ctx.Err()
}
case errors.As(err, &contextErr):
req = req.WithTruncatedContext(contextErr.MaxTokens)
default:
lastErr = err
time.Sleep(backoff(attempt))
}
}
return "", fmt.Errorf("max retries exceeded: %w", lastErr)
}
3. Building prompts with string concatenation
It starts fine. Then someone adds a special case. Then another. Then you have a 200-line function full of if/else building a string and nobody knows what the model is actually seeing.
Treat prompts like templates. Keep them in separate files. Version them.
type PromptTemplate struct {
System string
User string
}
func (t *PromptTemplate) Render(vars map[string]any) (CompletionRequest, error) {
systemTmpl, err := template.New("system").Parse(t.System)
if err != nil {
return CompletionRequest{}, err
}
// ... render both templates
}
When a prompt changes, it is a code change with a diff. Not a hidden string mutation buried in a function.
4. No observability until something breaks in production
Add structured logging from the start. Log the input hash, model, token counts, latency, and cache hit/miss on every call. When costs spike or latency degrades, you need this to debug.
c.logger.Info("llm_call",
"input_hash", req.Hash(),
"model", c.model,
"prompt_tokens", usage.PromptTokens,
"completion_tokens", usage.CompletionTokens,
"latency_ms", latency.Milliseconds(),
"cache_hit", false,
)
I use this to build a cost dashboard per feature, per user, per day. Without it you are flying blind.
RAG in Go: keep it boring
Retrieval-augmented generation does not need a vector database for most use cases. I have seen teams spin up Pinecone or Weaviate for a use case that would work fine with Postgres and pgvector.
Start with pgvector. It is boring. It works. You already have Postgres. The query looks like:
SELECT content, 1 - (embedding <=> $1) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1) > 0.7
ORDER BY similarity DESC
LIMIT 5;
Migrate to a dedicated vector store when you have a real performance problem. Not before.
For generating embeddings in Go:
type EmbeddingClient struct {
httpClient *http.Client
apiKey string
model string
}
func (e *EmbeddingClient) Embed(ctx context.Context, text string) ([]float32, error) {
// standard HTTP call to your embedding endpoint
// returns float32 slice ready for pgvector
}
No SDK needed. The API is a POST request. Write the 30 lines and move on.
Structured output without the drama
Getting JSON out of an LLM reliably is harder than it looks. Models hallucinate keys, nest objects incorrectly, or return valid JSON wrapped in markdown code blocks.
What works in production:
- Use a model that supports native JSON mode or function calling
- Define your schema explicitly in the prompt with an example
- Validate with a strict parser, not
json.Unmarshaldirectly - Retry on parse failure with a self-correction prompt: "Your previous response was not valid JSON. Here is the error: {error}. Please try again."
The self-correction step recovers 90% of parse failures without a human in the loop.
Cost math before you build
Before writing a line of code, do the math:
daily_requests * avg_input_tokens * input_price
+ daily_requests * avg_output_tokens * output_price
= daily_cost
Then ask: what is the cache hit rate I can realistically achieve? What is the monthly cost at 10x scale?
I have seen products that were profitable at 100 users become unprofitable at 1000 because nobody did this calculation. LLM costs scale linearly. Your pricing needs to account for it.
What I would do differently
If I were starting a new LLM-heavy Go service today:
- Build the cache layer before the LLM integration, not after
- Log token usage from day one, not when the bill arrives
- Keep prompts in files, not code
- Write an LLM interface with a mock implementation for tests
- Do the cost math before committing to a feature
The patterns that work in Go for LLM pipelines are not exotic. They are the same patterns that work for any external API integration: encapsulate, cache, observe, handle errors explicitly. The LLM just has a few extra failure modes.
If you are building something that needs a production AI backend and want to talk through the architecture, I am available for contracts. arif.sh/work
Top comments (0)