arif

Posted on Feb 17

Building Production LLM Pipelines in Go: Lessons From the Field

#go #ai #backend #machinelearning

I've spent the last two years integrating LLMs into production systems - a fintech platform processing millions of transactions and a document processing pipeline for government-scale workloads. Go is not the obvious choice for AI work. Python gets all the attention. But for production backends where reliability, cost, and latency actually matter, Go has been the better call every time.

Here's what I've learned.

Why Go for LLM pipelines?

The Python AI ecosystem is rich but chaotic. LangChain adds abstractions on top of abstractions. Dependency conflicts are constant. Async code that works in a notebook falls apart under real concurrency.

Go gives you:

Predictable memory usage under load
Goroutines for cheap, real concurrency (not async/await workarounds)
Compile-time errors that catch the stupid mistakes before they hit production
A single binary that deploys anywhere

The tradeoff is you write more code. There is no LangChain equivalent in Go. But in my experience, that is a feature. You understand exactly what your pipeline is doing.

The architecture that works

After iterating on several production systems, I keep coming back to this structure:

Request → Validator → ContextBuilder → LLMClient → ResponseParser → Cache → Response

Each stage is a function that takes input and returns output. No magic, no framework. Just functions.

type Pipeline struct {
    validator      Validator
    contextBuilder ContextBuilder
    llmClient      LLMClient
    parser         ResponseParser
    cache          Cache
}

func (p *Pipeline) Run(ctx context.Context, req Request) (Response, error) {
    if err := p.validator.Validate(req); err != nil {
        return Response{}, fmt.Errorf("validation: %w", err)
    }

    context, err := p.contextBuilder.Build(ctx, req)
    if err != nil {
        return Response{}, fmt.Errorf("context: %w", err)
    }

    cacheKey := p.cache.Key(req, context)
    if cached, ok := p.cache.Get(cacheKey); ok {
        return cached, nil
    }

    raw, err := p.llmClient.Complete(ctx, context)
    if err != nil {
        return Response{}, fmt.Errorf("llm: %w", err)
    }

    result, err := p.parser.Parse(raw)
    if err != nil {
        return Response{}, fmt.Errorf("parse: %w", err)
    }

    p.cache.Set(cacheKey, result)
    return result, nil
}

Simple. Each component is testable in isolation. Swap the LLM client without touching anything else.

The mistakes that cost real money

1. Not caching aggressively enough

LLM calls are expensive. In a document processing pipeline, the same context gets rebuilt and sent to the model repeatedly for similar inputs. We added a two-layer cache: in-memory for hot keys, Redis for warm keys. Cost dropped 40% overnight.

The key insight: you do not need semantic similarity matching for most caching. A hash of the normalized input is enough for 80% of cache hits.

func (c *PipelineCache) Key(req Request, ctx PipelineContext) string {
    h := sha256.New()
    h.Write([]byte(req.NormalizedInput()))
    h.Write([]byte(ctx.SystemPrompt))
    return hex.EncodeToString(h.Sum(nil))
}

2. Treating LLM errors like HTTP errors

LLMs fail in ways HTTP servers do not. You get:

Rate limit errors (back off and retry)
Malformed JSON responses (retry with a stricter prompt)
Responses that are technically valid but semantically wrong (detect and escalate)
Context window overflows (truncate and retry)

Each failure mode needs a different response. A generic retry loop handles none of them well.

func (c *LLMClient) Complete(ctx context.Context, req CompletionRequest) (string, error) {
    var lastErr error
    for attempt := 0; attempt < c.maxRetries; attempt++ {
        resp, err := c.call(ctx, req)
        if err == nil {
            return resp, nil
        }

        var rateLimitErr *RateLimitError
        var contextErr *ContextWindowError

        switch {
        case errors.As(err, &rateLimitErr):
            wait := time.Duration(rateLimitErr.RetryAfterSeconds) * time.Second
            select {
            case <-time.After(wait):
            case <-ctx.Done():
                return "", ctx.Err()
            }
        case errors.As(err, &contextErr):
            req = req.WithTruncatedContext(contextErr.MaxTokens)
        default:
            lastErr = err
            time.Sleep(backoff(attempt))
        }
    }
    return "", fmt.Errorf("max retries exceeded: %w", lastErr)
}

3. Building prompts with string concatenation

It starts fine. Then someone adds a special case. Then another. Then you have a 200-line function full of if/else building a string and nobody knows what the model is actually seeing.

Treat prompts like templates. Keep them in separate files. Version them.

type PromptTemplate struct {
    System string
    User   string
}

func (t *PromptTemplate) Render(vars map[string]any) (CompletionRequest, error) {
    systemTmpl, err := template.New("system").Parse(t.System)
    if err != nil {
        return CompletionRequest{}, err
    }
    // ... render both templates
}

When a prompt changes, it is a code change with a diff. Not a hidden string mutation buried in a function.

4. No observability until something breaks in production

Add structured logging from the start. Log the input hash, model, token counts, latency, and cache hit/miss on every call. When costs spike or latency degrades, you need this to debug.

c.logger.Info("llm_call",
    "input_hash", req.Hash(),
    "model", c.model,
    "prompt_tokens", usage.PromptTokens,
    "completion_tokens", usage.CompletionTokens,
    "latency_ms", latency.Milliseconds(),
    "cache_hit", false,
)

I use this to build a cost dashboard per feature, per user, per day. Without it you are flying blind.

RAG in Go: keep it boring

Retrieval-augmented generation does not need a vector database for most use cases. I have seen teams spin up Pinecone or Weaviate for a use case that would work fine with Postgres and pgvector.

Start with pgvector. It is boring. It works. You already have Postgres. The query looks like:

SELECT content, 1 - (embedding <=> $1) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1) > 0.7
ORDER BY similarity DESC
LIMIT 5;

Migrate to a dedicated vector store when you have a real performance problem. Not before.

For generating embeddings in Go:

type EmbeddingClient struct {
    httpClient *http.Client
    apiKey     string
    model      string
}

func (e *EmbeddingClient) Embed(ctx context.Context, text string) ([]float32, error) {
    // standard HTTP call to your embedding endpoint
    // returns float32 slice ready for pgvector
}

No SDK needed. The API is a POST request. Write the 30 lines and move on.

Structured output without the drama

Getting JSON out of an LLM reliably is harder than it looks. Models hallucinate keys, nest objects incorrectly, or return valid JSON wrapped in markdown code blocks.

What works in production:

Use a model that supports native JSON mode or function calling
Define your schema explicitly in the prompt with an example
Validate with a strict parser, not json.Unmarshal directly
Retry on parse failure with a self-correction prompt: "Your previous response was not valid JSON. Here is the error: {error}. Please try again."

The self-correction step recovers 90% of parse failures without a human in the loop.

Cost math before you build

Before writing a line of code, do the math:

daily_requests * avg_input_tokens * input_price
+ daily_requests * avg_output_tokens * output_price
= daily_cost

Then ask: what is the cache hit rate I can realistically achieve? What is the monthly cost at 10x scale?

I have seen products that were profitable at 100 users become unprofitable at 1000 because nobody did this calculation. LLM costs scale linearly. Your pricing needs to account for it.

What I would do differently

If I were starting a new LLM-heavy Go service today:

Build the cache layer before the LLM integration, not after
Log token usage from day one, not when the bill arrives
Keep prompts in files, not code
Write an LLM interface with a mock implementation for tests
Do the cost math before committing to a feature

The patterns that work in Go for LLM pipelines are not exotic. They are the same patterns that work for any external API integration: encapsulate, cache, observe, handle errors explicitly. The LLM just has a few extra failure modes.

If you are building something that needs a production AI backend and want to talk through the architecture, I am available for contracts. arif.sh/work

DEV Community