Oscar Rieken

Posted on May 23

Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch

#go #llm #ai #architecture

When TestSmith generates tests with --llm, it calls an LLM for every public member of every source file being processed. A project with 20 files and 5 public functions each means up to 100 API calls in a single run. That's a lot of surface area for things to go wrong.

Here's the reliability stack we built, layer by layer.

Layer 1: Retry with Exponential Backoff

LLM APIs fail transiently. Rate limits, timeouts, occasional 5xx responses — all of these are recoverable if you wait and retry. We built a retry middleware that wraps any Provider:

type RetryProvider struct {
    inner      Provider
    maxRetries int
}

func (r *RetryProvider) Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error) {
    var lastErr error
    for attempt := 0; attempt < r.maxRetries; attempt++ {
        if attempt > 0 {
            wait := time.Duration(math.Pow(2, float64(attempt))) * 100 * time.Millisecond
            select {
            case <-time.After(wait):
            case <-ctx.Done():
                return CompletionResponse{}, ctx.Err()
            }
        }
        resp, err := r.inner.Complete(ctx, req)
        if err == nil {
            return resp, nil
        }
        lastErr = err
    }
    return CompletionResponse{}, fmt.Errorf("after %d attempts: %w", r.maxRetries, lastErr)
}

MaxRetryAttempts defaults to 3. With exponential backoff: attempt 1 is immediate, attempt 2 waits 200ms, attempt 3 waits 400ms. Total worst-case wait per call is under a second — acceptable latency for a background tool.

Layer 2: Semaphore for Concurrency Control

With up to 100 calls to make, goroutine fan-out is the obvious approach. But hitting an LLM API with 100 concurrent requests triggers rate limiting immediately. A semaphore caps the in-flight calls:

type SemaphoreProvider struct {
    inner Provider
    sem   chan struct{}
}

func NewSemaphoreProvider(inner Provider, maxConcurrent int) *SemaphoreProvider {
    return &SemaphoreProvider{inner: inner, sem: make(chan struct{}, maxConcurrent)}
}

func (s *SemaphoreProvider) Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error) {
    select {
    case s.sem <- struct{}{}:
        defer func() { <-s.sem }()
    case <-ctx.Done():
        return CompletionResponse{}, ctx.Err()
    }
    return s.inner.Complete(ctx, req)
}

MaxConcurrentCalls defaults to 5. Each retry attempt acquires its own semaphore slot — this is important. If retry logic held a slot while waiting between attempts, other goroutines would be blocked unnecessarily. The retry wrapper is the outer layer; semaphore is the inner layer.

The middleware stack assembled by the factory:

retry → semaphore → raw provider

Layer 3: Result Cache

Many test generation runs touch the same files repeatedly — watch mode is the extreme case. Calling the LLM for the same source code twice is wasteful. A content-addressed cache avoids it:

type ResultCache struct {
    mu      sync.RWMutex
    entries map[string][]BodyGenResult
    hits    int
    misses  int
}

func cacheKey(req BodyGenRequest) string {
    h := sha256.New()
    fmt.Fprintf(h, "%s\n%s\n%s\n%s", req.Language, req.MemberName, req.SourceCode, req.Framework.Name)
    return hex.EncodeToString(h[:])
}

The key is a SHA-256 hash of the language, member name, source code, and framework. If the source file changes, the hash changes and the cache misses — you always get fresh results for changed code.

After a run, --verbose prints the cache stats:

LLM cache — hits: 12  misses: 8  entries: 8

Layer 4: Batch Generation

The fan-out approach makes one API call per public member. For a file with 10 functions, that's 10 calls. Batch generation collapses this to one:

func (g *LLMBodyGenerator) GenerateBatchBodies(
    ctx context.Context,
    reqs []BodyGenRequest,
) ([]BodyGenResult, error) {
    prompt := buildBatchPrompt(reqs)
    resp, err := g.provider.Complete(ctx, CompletionRequest{
        SystemPrompt:   batchSystemPrompt,
        UserPrompt:     prompt,
        Model:          g.model,
        MaxTokens:      g.maxTokens * len(reqs), // scale with request count
        Temperature:    g.temperature,
        ResponseFormat: "json_object",            // structured output
    })
    // ...
}

We use OpenAI's response_format: {"type": "json_object"} to get structured output. The model returns a JSON envelope with one entry per member:

{
  "tests": [
    {"name": "ProcessPayment", "code": "func TestProcessPayment(t *testing.T) { ... }"},
    {"name": "RefundPayment",  "code": "func TestRefundPayment(t *testing.T) { ... }"}
  ]
}

We parse that with a primary JSON parser, with a fallback to a delimiter-regex parser for providers that don't support structured output.

The pipeline checks for the BatchBodyGenerator interface via type assertion. If the generator implements it, batch mode is used. If not (or if the driver explicitly opts out), it falls back to goroutine fan-out with individual calls. This keeps the interface opt-in and backward compatible.

Observability: Cache Stats

With all this happening in the background, it's useful to know what actually ran. The cacheStatsReporter interface lets the CLI query stats without importing the llm package:

// In cmd/testsmith/generate.go — avoids importing internal/llm from the CLI layer
type cacheStatsReporter interface {
    CacheStats() (hits, misses, size int)
}

func printCacheStats(bg domain.BodyGenerator) {
    if !verbose {
        return
    }
    if r, ok := bg.(cacheStatsReporter); ok {
        hits, misses, size := r.CacheStats()
        fmt.Printf("LLM cache — hits: %d  misses: %d  entries: %d\n", hits, misses, size)
    }
}

This is the interface segregation principle at work: the CLI knows about domain.BodyGenerator (which it needs for the pipeline) and cacheStatsReporter (which it needs for stats output). It doesn't need to know anything else about the LLM implementation.

The Numbers

In practice, on a mid-size Go project with 40 source files and an average of 6 public functions each:

Without batch: 240 API calls, ~4 minutes at 5 concurrent
With batch: 40 API calls (one per file), ~45 seconds
Second run with warm cache: near-instant for unchanged files

The cache and batch generation together turn what would be a "go make coffee" operation into something you can run while you're still in the flow.

Next: how we structure context for both AI agents working on TestSmith itself and for the LLM generating tests for your project.

DEV Community