DEV Community

arif
arif

Posted on

AI Agent Failures Are Distributed Systems Failures. Here's the Complete Mapping.

A few months into building an AI agent pipeline for a fintech client, we had a silent failure that cost us three days.

The agent processed a document. Returned a confident-looking response. No error, no exception, no log entry that suggested anything was wrong. That output went into the next step, which used it to write a decision record. The decision record went downstream. Three steps later, a human reviewer flagged something that did not add up.

The root cause was a hallucinated intermediate field. One field. The model had made up a plausible-sounding value for something it should have extracted from the document. Everything downstream had treated that invented value as real.

I had seen this failure mode before. Not in AI. In distributed systems.

The microservice that returns 200 OK while writing corrupted data. The queue consumer that marks a message processed before finishing the work. The retry that fires twice because the ACK never arrived.

Same pattern. Different worker.


The mental model that changes everything

An AI agent is not a program that figures things out. It is a graph of tasks, each processed by a nondeterministic worker. The worker is expensive, slow, and occasionally wrong in ways that look correct. Everything else about the system, including how it should be built, is a distributed systems problem.

Here is the complete mapping:

Distributed systems concept AI agent equivalent
Message queue Task buffer between agent steps
Dead letter queue Failed tasks awaiting human review
Idempotency key Deterministic task ID preventing duplicate LLM calls
Circuit breaker Model fallback when primary LLM degrades or rate-limits
Saga pattern + compensation Multi-step workflow rollback when a step fails midway
Two-phase commit Human approval gate before any irreversible action
Distributed tracing Per-step observability across agent execution
Backpressure Concurrency limits on parallel LLM calls
Bulkhead pattern Cost and rate isolation between tenants or features

Every one of these concepts exists because engineers got paged at 3am when they did not have it. AI agent engineers are getting paged for exactly the same reasons. They are building the same workarounds and calling them new things.

The people writing most of the content about AI agents come from ML backgrounds. They have never been responsible for a distributed system that split-brained in production. They are reinventing concepts that have names.


Problem 1: Silent failures that compound across steps

This is the issue nobody who writes tutorials has actually experienced in a high-stakes system. LLM workers do not fail loudly. They return well-formed, confident-looking output that is semantically wrong. In a single-step system a human catches it. In a multi-step agent, the wrong output becomes the input to the next step.

By the time you see the problem, the causal chain is four steps long and the original error is buried.

The distributed systems answer: validate at every step boundary. Do not pass raw LLM output downstream. Treat every LLM response as untrusted input until it passes a validation check.

type StepResult struct {
    Raw      string
    Valid    bool
    Schema   string
}

type Validator interface {
    Validate(output, schema string) error
}

func (w *Worker) processStep(ctx context.Context, input string, schema string) (*StepResult, error) {
    raw, err := w.llm.Complete(ctx, input)
    if err != nil {
        return nil, err
    }

    if err := w.validator.Validate(raw, schema); err != nil {
        // self-correction: one retry with the validation error in context
        corrected, err2 := w.llm.Complete(ctx, fmt.Sprintf(
            "Your previous response failed validation: %s\n\nOriginal task: %s\n\nTry again.",
            err.Error(), input,
        ))
        if err2 != nil {
            return nil, err
        }
        if err3 := w.validator.Validate(corrected, schema); err3 != nil {
            return nil, fmt.Errorf("validation failed after correction: %w", err3)
        }
        raw = corrected
    }

    return &StepResult{Raw: raw, Valid: true, Schema: schema}, nil
}
Enter fullscreen mode Exit fullscreen mode

Self-correction on first validation failure recovers about 90% of cases in practice. The remaining 10% go to a dead letter queue, not to the next step in the pipeline.


Problem 2: Idempotency, the hardest problem in agent systems

This is the one I underestimated the most.

In standard distributed systems, idempotency is straightforward: assign a key, deduplicate on that key, retry safely. In agent systems, there is a complication. The LLM worker is nondeterministic. Retry with the same input and you may get a different output.

This creates a genuine design question: what does it mean to retry an agent step idempotently?

The answer that works in practice: record the output the first time the step runs. On retry, return the recorded output rather than calling the LLM again. The task ID becomes the idempotency key.

type TaskStore interface {
    Get(ctx context.Context, taskID string) (string, bool, error)
    Set(ctx context.Context, taskID string, result string) error
}

func (w *Worker) run(ctx context.Context, task Task) (string, error) {
    // check if already completed
    if result, ok, err := w.store.Get(ctx, task.ID); err != nil {
        return "", fmt.Errorf("store get: %w", err)
    } else if ok {
        return result, nil
    }

    result, err := w.processStep(ctx, task.Input, task.Schema)
    if err != nil {
        return "", err
    }

    if err := w.store.Set(ctx, task.ID, result.Raw); err != nil {
        return "", fmt.Errorf("store set: %w", err)
    }

    return result.Raw, nil
}
Enter fullscreen mode Exit fullscreen mode

This also makes your workflows resumable. A pipeline that crashes midway can restart from the last completed step rather than from scratch. In the fintech system this was essential: reprocessing a document from scratch on retry risked producing a different decision record.

The one place this pattern breaks down: when you want a fresh LLM response on retry because the recorded one failed validation. The solution is to record the output only after validation passes. Failed attempts do not get persisted.


Problem 3: Multi-step workflows need a rollback plan

Here is a scenario that forced me to think about this properly.

A four-step pipeline: extract data from a document, validate it against business rules, write a decision record, trigger a downstream system. The first three steps succeed. Step four fails. What do you do?

Without explicit design, you are stuck. The decision record exists. Retrying the whole pipeline from the start would write a duplicate. Doing nothing leaves the system in a broken intermediate state.

The Saga pattern from distributed systems has a direct answer: every step that has side effects needs a compensating action that undoes it.

type Step struct {
    Name       string
    Execute    func(ctx context.Context, state *State) error
    Compensate func(ctx context.Context, state *State) error
}

type Saga struct {
    steps    []Step
    executed []int
}

func (s *Saga) Run(ctx context.Context, state *State) error {
    for i, step := range s.steps {
        if err := step.Execute(ctx, state); err != nil {
            log.Printf("step %s failed, rolling back %d completed steps", step.Name, len(s.executed))
            for j := len(s.executed) - 1; j >= 0; j-- {
                if cerr := s.steps[s.executed[j]].Compensate(ctx, state); cerr != nil {
                    log.Printf("compensation for step %s failed: %v", s.steps[s.executed[j]].Name, cerr)
                }
            }
            return fmt.Errorf("step %d (%s) failed: %w", i, step.Name, err)
        }
        s.executed = append(s.executed, i)
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Applied to the document pipeline: if the downstream notification fails, the compensation for the decision-record step deletes it. The system returns to a clean state and the whole workflow can safely retry.

Most agent frameworks do not give you this. They give you a chain of steps and leave failure handling as an exercise for the reader.


Problem 4: Human-in-the-loop as blast radius management

One of the cleanest mental models I have encountered for human review in agent systems: scope the human-in-the-loop requirement to the blast radius of the action.

A read-only action that retrieves data: no approval needed, just log it.
A write action that can be undone: soft approval, allow with logging and rollback capability.
An irreversible action (sending an email, triggering a payment, deleting a record): hard approval gate, requires explicit human confirmation before execution.

This is not a safety net. It is an architecture decision about which operations an automated system is permitted to take unilaterally.

type BlastRadius int

const (
    BlastRadiusRead        BlastRadius = iota // read-only, no approval
    BlastRadiusReversible                     // can be undone, soft approval
    BlastRadiusIrreversible                   // cannot be undone, hard approval
)

type ActionGate interface {
    Request(ctx context.Context, action Action, radius BlastRadius) (Approval, error)
}

type Action struct {
    ID          string
    Description string
    Params      map[string]any
    Radius      BlastRadius
}
Enter fullscreen mode Exit fullscreen mode

In the German government system, any output below a confidence threshold and any action classified as irreversible went to a human review queue. Not because we lacked confidence in the model, but because the downstream consequences of an error had legal weight. The auditors asked specifically about this control. Having it designed explicitly made the audit straightforward.

The engineers I see struggling most with this are the ones trying to automate 100% of cases. A more honest design goal: automate the 90-95% the system can handle confidently, and escalate the rest with full context attached.


Problem 5: Circuit breakers for model failures

LLM APIs have bad days. Rate limits, degraded performance, elevated error rates. You need a circuit breaker.

type State int

const (
    StateClosed   State = iota // normal operation
    StateOpen                  // failing fast
    StateHalfOpen              // testing recovery
)

type CircuitBreaker struct {
    mu          sync.Mutex
    state       State
    failures    int
    threshold   int
    lastFailure time.Time
    timeout     time.Duration
}

func (cb *CircuitBreaker) Allow() bool {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    switch cb.state {
    case StateOpen:
        if time.Since(cb.lastFailure) > cb.timeout {
            cb.state = StateHalfOpen
            return true
        }
        return false
    default:
        return true
    }
}

func (cb *CircuitBreaker) Record(err error) {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    if err != nil {
        cb.failures++
        cb.lastFailure = time.Now()
        if cb.failures >= cb.threshold {
            cb.state = StateOpen
        }
    } else {
        cb.failures = 0
        cb.state = StateClosed
    }
}
Enter fullscreen mode Exit fullscreen mode

In a multi-model setup, combine the circuit breaker with a fallback: primary model trips the circuit, requests route to a cheaper backup. This keeps the system available during provider incidents. It also gives you a cheap escape valve when costs spike unexpectedly.

Speaking of costs.


Problem 6: Observability and cost attribution

In the summer of last year, a team publicly shared that they had spent $47k running AI agents in production before understanding where the money was going. The number got attention because people recognized the scenario.

LLM costs are invisible until they are not. A feature that costs nothing at 100 users is a real number at 100,000. An edge case that you never hit in testing might consume 40x the tokens of a typical case.

You cannot find this without logging every call with the context you need to understand it.

type CallTrace struct {
    TraceID          string
    FeatureName      string
    StepName         string
    Model            string
    PromptTokens     int
    CompletionTokens int
    CacheHit         bool
    LatencyMs        int64
    ValidationPassed bool
    Attempt          int
    Timestamp        time.Time
}

func tokenCost(model string, prompt, completion int) float64 {
    rates := map[string][2]float64{
        "gpt-4o":              {0.0000025, 0.000010},
        "gpt-4o-mini":         {0.00000015, 0.0000006},
        "claude-3-5-sonnet":   {0.000003, 0.000015},
        "claude-3-5-haiku":    {0.0000008, 0.000004},
    }
    r, ok := rates[model]
    if !ok {
        return 0
    }
    return float64(prompt)*r[0] + float64(completion)*r[1]
}
Enter fullscreen mode Exit fullscreen mode

Store the traces. Aggregate costs by feature, by step, by user. Build a dashboard before the bill surprises you, not after.

The secondary value: when something fails in production, you have the full execution trace. Input, intermediate steps, output, validation result. In a regulated environment this is not optional. Auditors ask for it. In any environment it makes debugging tractable instead of theoretical.


What this means for frameworks

The popular agent frameworks optimize for developer experience. They make it fast to connect LLM calls, experiment with different prompting strategies, and ship a demo.

That is genuinely useful. I have used some of them for prototyping.

The problem is they tend to abstract away exactly what matters in production: queue semantics, idempotency, compensation flows, circuit breakers, cost attribution. You end up responsible for a system you do not fully understand, where the hard parts live inside library code you cannot easily inspect.

The pattern that works: use a framework to figure out what your workflow should do. When you are ready to ship, write the orchestration layer with infrastructure primitives. Keep the LLM integration logic. Replace the framework's queue and retry machinery with something you control.

This sounds like more work. It is less work in practice, because when something goes wrong at 2am you can actually find the problem.


The Germany constraint that turned out to be good engineering

The German government clients had a requirement I initially saw as a handicap: every decision made by the system had to be explainable to a civil servant who had no background in AI.

That requirement killed several architectural ideas that would have worked fine in a consumer product. It forced the system toward a design where every step is logged with its inputs and outputs, the LLM is one component in a documented process rather than a black box, and there is always a human review path for anything below a defined confidence threshold.

The resulting system is more verbose. It is also significantly easier to debug, trivial to audit, and more reliable under real load.

The explainability requirement is coming for regulated industries whether teams are ready for it or not. Financial services, healthcare, legal, government. Design for it now and you get better engineering as a side effect.


Where to start

If you are building an agent system and currently thinking mainly about prompts, here is a concrete sequence:

  1. Define your task schema first. What does a valid output look like at each step? What validation rules apply? This determines your circuit-break conditions and your dead letter queue trigger.

  2. Add idempotency before you add features. Assign a deterministic task ID to every step. Store results. Make it safe to retry.

  3. Design your Saga before you wire up the steps. For every step that writes something, know what the compensation action is.

  4. Set your blast radius policy. Which actions require human approval before execution? Write it down as code, not documentation.

  5. Add the cost trace from day one. Not later. The cost dashboard is significantly harder to retrofit than to build initially.

The prompt engineering is the interesting part. The infrastructure is what makes it shippable.


If you are working on production AI systems and want to compare notes, I am at arif.sh/work.

Top comments (0)