DEV Community

Devon
Devon

Posted on

Multi-Agent Systems Break Differently Than Single Agents

A single agent failing is a tractable problem. You have a bad output, a traceback, maybe a timeout. You fix the prompt or swap the model. Multi-agent pipelines fail differently: one agent produces plausible-looking garbage, the next agent consumes it without complaint, and by the time the third agent produces the final output it's confidently wrong in ways that are nearly impossible to trace back to the root cause.

This post covers the mechanics of how failures compound across agent hops, the context propagation problem, and how to instrument a pipeline so you can actually diagnose failures when they happen.

The Compounding Failure Problem

In a single-agent system, the failure surface is contained. Bad input produces bad output and you can observe both.

In a multi-agent pipeline:

Agent A → output_A → Agent B → output_B → Agent C → final_output
Enter fullscreen mode Exit fullscreen mode

If Agent A produces subtly wrong output, Agent B receives it as ground truth. Agent B may produce output that looks internally consistent but is built on a flawed foundation. Agent C synthesizes a final answer from Agent B's compromised output.

The final output can fail in three distinct ways:

  1. Hard failure - Agent C raises an exception or returns an empty result
  2. Soft failure - Agent C returns a plausible but wrong answer with high confidence
  3. Compounding degradation - Each hop degrades quality slightly; the final output is below the threshold for usefulness even though no individual agent "failed"

Soft failures and compounding degradation are far harder to catch. They don't surface in error logs. They surface in user complaints, downstream data quality issues, or silent business logic failures.

A Concrete 3-Agent Pipeline

Here's a realistic pipeline: research agent pulls context, analysis agent synthesizes findings, writer agent produces the final output.

import kalibr
from kalibr import Router
from dataclasses import dataclass, field
from typing import Optional
import uuid
import time

@dataclass
class TraceCapsule:
    """Propagate context and quality signals across agent hops."""
    goal_id: str
    pipeline_id: str
    hop: int = 0
    quality_scores: list[float] = field(default_factory=list)
    failure_flags: list[str] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)

    def advance(self, quality_score: float, hop_metadata: dict = None) -> "TraceCapsule":
        """Return a new capsule for the next hop."""
        return TraceCapsule(
            goal_id=self.goal_id,
            pipeline_id=self.pipeline_id,
            hop=self.hop + 1,
            quality_scores=self.quality_scores + [quality_score],
            failure_flags=self.failure_flags.copy(),
            metadata={**self.metadata, **(hop_metadata or {})}
        )

    def flag_failure(self, reason: str) -> "TraceCapsule":
        return TraceCapsule(
            goal_id=self.goal_id,
            pipeline_id=self.pipeline_id,
            hop=self.hop,
            quality_scores=self.quality_scores.copy(),
            failure_flags=self.failure_flags + [reason],
            metadata=self.metadata.copy()
        )

    @property
    def cumulative_quality(self) -> float:
        if not self.quality_scores:
            return 1.0
        return sum(self.quality_scores) / len(self.quality_scores)

    @property
    def has_failures(self) -> bool:
        return len(self.failure_flags) > 0
Enter fullscreen mode Exit fullscreen mode

Now the three agents, each instrumenting the TraceCapsule:

import openai

client = openai.OpenAI()

def research_agent(query: str, trace: TraceCapsule) -> tuple[str, TraceCapsule]:
    """Hop 1: Retrieve and summarize relevant context."""
    router = Router(
        goal_id=trace.goal_id,
        task_type="research",
        hop=trace.hop
    )
    policy = router.get_policy()

    try:
        response = client.chat.completions.create(
            model=policy.model,
            messages=[
                {"role": "system", "content": "You are a research assistant. Return structured findings."},
                {"role": "user", "content": f"Research this topic and return key facts:\n\n{query}"}
            ],
            temperature=0.2
        )
        content = response.choices[0].message.content

        # Assess output quality before passing downstream
        quality = _assess_research_quality(content)
        if quality < 0.4:
            trace = trace.flag_failure(f"hop_0_low_quality:{quality:.2f}")

        router.record_outcome(
            success=quality >= 0.4,
            quality_score=quality,
            tokens_used=response.usage.total_tokens
        )

        return content, trace.advance(quality, {"hop_0_model": policy.model})

    except Exception as e:
        router.record_outcome(success=False, error=str(e))
        trace = trace.flag_failure(f"hop_0_exception:{type(e).__name__}")
        return "", trace.advance(0.0)


def analysis_agent(research_output: str, query: str, trace: TraceCapsule) -> tuple[str, TraceCapsule]:
    """Hop 2: Analyze the research output and extract insights."""
    # If upstream already failed, short-circuit with lower-capability model
    router = Router(
        goal_id=trace.goal_id,
        task_type="analysis",
        hop=trace.hop,
        upstream_quality=trace.cumulative_quality
    )
    policy = router.get_policy()

    if trace.has_failures and trace.cumulative_quality < 0.3:
        # Upstream quality is too low to invest in expensive synthesis
        trace = trace.flag_failure("hop_1_skipped_upstream_quality_too_low")
        return "", trace.advance(0.0)

    if not research_output.strip():
        trace = trace.flag_failure("hop_1_empty_upstream_input")
        router.record_outcome(success=False, error="empty_input")
        return "", trace.advance(0.0)

    try:
        response = client.chat.completions.create(
            model=policy.model,
            messages=[
                {"role": "system", "content": "Analyze the provided research. Identify key insights and gaps."},
                {"role": "user", "content": f"Original question: {query}\n\nResearch findings:\n{research_output}\n\nProvide structured analysis."}
            ],
            temperature=0.3
        )
        content = response.choices[0].message.content
        quality = _assess_analysis_quality(content)

        router.record_outcome(
            success=quality >= 0.5,
            quality_score=quality,
            tokens_used=response.usage.total_tokens
        )

        return content, trace.advance(quality, {"hop_1_model": policy.model})

    except Exception as e:
        router.record_outcome(success=False, error=str(e))
        trace = trace.flag_failure(f"hop_1_exception:{type(e).__name__}")
        return "", trace.advance(0.0)


def writer_agent(analysis_output: str, query: str, trace: TraceCapsule) -> tuple[str, TraceCapsule]:
    """Hop 3: Synthesize final response."""
    router = Router(
        goal_id=trace.goal_id,
        task_type="synthesis",
        hop=trace.hop,
        upstream_quality=trace.cumulative_quality
    )
    policy = router.get_policy()

    if not analysis_output.strip() or trace.cumulative_quality < 0.25:
        trace = trace.flag_failure("hop_2_cannot_synthesize")
        router.record_outcome(success=False, error="insufficient_upstream_quality")
        return _fallback_response(query, trace), trace.advance(0.0)

    try:
        response = client.chat.completions.create(
            model=policy.model,
            messages=[
                {"role": "system", "content": "Write a clear, direct response based on the analysis provided."},
                {"role": "user", "content": f"Question: {query}\n\nAnalysis:\n{analysis_output}"}
            ],
            temperature=0.5
        )
        content = response.choices[0].message.content
        quality = _assess_synthesis_quality(content, query)

        # Final outcome for the goal - this is what matters
        router.record_goal_outcome(
            goal_id=trace.goal_id,
            success=quality >= 0.6 and not trace.has_failures,
            quality_score=quality,
            pipeline_quality=trace.cumulative_quality,
            failure_flags=trace.failure_flags
        )

        return content, trace.advance(quality)

    except Exception as e:
        router.record_outcome(success=False, error=str(e))
        trace = trace.flag_failure(f"hop_2_exception:{type(e).__name__}")
        return _fallback_response(query, trace), trace.advance(0.0)
Enter fullscreen mode Exit fullscreen mode

And the pipeline orchestrator:

def run_pipeline(query: str) -> dict:
    goal_id = f"goal_{uuid.uuid4().hex[:12]}"
    pipeline_id = f"pipe_{int(time.time())}"

    trace = TraceCapsule(goal_id=goal_id, pipeline_id=pipeline_id)

    # Hop 0: Research
    research_output, trace = research_agent(query, trace)

    # Hop 1: Analysis (receives trace with hop 0 quality baked in)
    analysis_output, trace = analysis_agent(research_output, query, trace)

    # Hop 2: Synthesis
    final_output, trace = writer_agent(analysis_output, query, trace)

    return {
        "output": final_output,
        "goal_id": goal_id,
        "pipeline_quality": trace.cumulative_quality,
        "failure_flags": trace.failure_flags,
        "success": not trace.has_failures and trace.cumulative_quality >= 0.5
    }
Enter fullscreen mode Exit fullscreen mode

Why Per-Request Logging Isn't Enough

Most observability setups track individual requests. OpenAI gives you token counts and latencies per call. LangSmith traces individual chain steps. That's necessary but not sufficient for multi-agent systems.

The problem: you need to know whether the goal succeeded, not just whether each LLM call returned a 200.

Consider this scenario: Agent A returns a 200 with 450 tokens used. Agent B returns a 200 with 380 tokens used. Agent C returns a 200 with 520 tokens used. Your per-request logging shows three successful calls. The user got a wrong answer.

Per-goal outcome tracking means recording a single success/failure signal against the original intent, not against each intermediate step. The TraceCapsule pattern carries that goal ID through every hop so that when Agent C records the final outcome, it's attributable to the goal that initiated the pipeline.

This is also how Kalibr's Thompson Sampling works at the pipeline level. Each execution path through your pipeline (which model at each hop, which retry strategy, which fallback) is a bandit arm. Outcomes recorded against the goal feed the sampler, which updates the probability distributions that determine which path gets selected next time. Per-request logs can't feed this because they don't know whether the goal succeeded.

See Why Your AI Agent Retries Are Making Things Worse for how retry decisions at individual hops interact with pipeline-level outcomes.

The Context Propagation Problem

The TraceCapsule isn't just about tracking quality scores. It solves a structural problem: agents in a pipeline have no shared memory by default. Each agent call is stateless. The capsule is the shared state.

This matters when you need to make routing decisions based on upstream quality. Without the capsule, Agent B doesn't know that Agent A produced borderline output. It will happily spend tokens on expensive synthesis of low-quality input.

With the capsule pattern:

  • Agent B checks trace.cumulative_quality before deciding how much to invest
  • The router at each hop can use upstream_quality as a feature for model selection
  • Failed hops are propagated forward so Agent C can decide between synthesis and fallback

The alternative, without explicit context propagation, is that each agent call is made with full model capacity regardless of upstream state. You spend the same tokens whether the pipeline is on track or already compromised.

Failure Modes by Hop

Each hop in the pipeline has characteristic failure modes:

Hop 0 (Research/Retrieval)

  • Empty retrieval: no relevant context found, agent fabricates
  • Partial retrieval: some context but missing key facts, downstream analysis has gaps
  • Hallucinated structure: agent returns well-formatted JSON with fabricated values

Hop 1 (Analysis/Synthesis)

  • Uncritical acceptance: agent treats Hop 0 output as ground truth regardless of quality
  • Over-extraction: agent finds patterns in noise, produces confident-looking analysis of garbage
  • Context loss: agent summarizes away the specific facts that were actually needed

Hop 2 (Output/Writer)

  • Confident wrongness: high-quality prose built on flawed analysis
  • Compounding hedges: if upstream agents hedged their outputs, the writer produces vague output
  • Format compliance masking failure: output passes schema validation but fails on content

The quality assessment functions (_assess_research_quality, etc.) in the example above are where you encode your domain-specific checks. They don't have to be sophisticated. A research output with fewer than 100 tokens probably failed. An analysis output with no structured sections probably failed. These heuristics, combined with per-goal outcome tracking, give you enough signal to route intelligently.

Diagnostics with get_insights()

Once you have outcome data flowing through Kalibr, you can query it to understand where your pipeline is degrading:

import kalibr

insights = kalibr.get_insights(
    goal_prefix="goal_",
    lookback_hours=24,
    group_by="hop"
)

for hop_data in insights.by_hop:
    print(f"Hop {hop_data.hop}: {hop_data.success_rate:.1%} success, "
          f"avg quality {hop_data.avg_quality:.2f}, "
          f"top failures: {hop_data.top_failure_flags[:3]}")

# Outputs something like:
# Hop 0: 94.2% success, avg quality 0.71, top failures: ['low_quality:0.38', 'timeout', 'empty_output']
# Hop 1: 88.7% success, avg quality 0.65, top failures: ['skipped_upstream_quality_too_low', 'empty_upstream_input']
# Hop 2: 91.3% success, avg quality 0.73, top failures: ['cannot_synthesize', 'hop_2_exception:RateLimitError']
Enter fullscreen mode Exit fullscreen mode

This surfaces where in the pipeline quality is degrading, which failure flags are most common, and which models at which hops are performing best. Without per-goal tracking, you'd have to reconstruct this from disparate request logs.

Key Takeaways

Multi-agent pipelines require observability primitives that don't exist in single-agent setups:

  1. TraceCapsule or equivalent - explicit context propagation across hops
  2. Per-goal outcome tracking - success recorded against the original intent, not each LLM call
  3. Upstream quality as a routing input - don't spend tokens synthesizing bad input
  4. Hop-level failure flags - propagate failure signals forward so downstream agents can decide

The failure mode you really want to avoid is the pipeline that looks fine in your request logs and costs you full token spend but produces wrong answers at a 30% rate. That failure is invisible without goal-level tracking.

For more on how Thompson Sampling applies to routing decisions at each hop, see Stop Hardcoding Your AI Model Selection.

Top comments (0)