Michael Bogan

Posted on Mar 18

When AI Agents Get It Wrong: The Accountability Crisis in Multi-Agent Systems

#ai #agents #logging #observability

In the world of security and DevOps, AI agents are being pushed from demos into production quickly. They triage security alerts, coordinate incident response, provision infrastructure, and decide which remediation playbooks to run.

When it all works, everyone is happy. It’s a force multiplier. But when an AI agent fails … who do you blame?

It can be hard to tell who made the call, why it happened, or what evidence exists to explain it. This is even more difficult in multi-agent systems, where responsibility is distributed across models, tools, orchestrators, and human operators. This distribution is powerful, but it also creates an accountability gap that most teams aren’t prepared for.

This is not a theoretical issue. Regulators and standards bodies are converging on real expectations for governance, traceability, and auditability. The NIST AI Risk Management Framework's GOVERN and MAP functions explicitly call for documented roles, risk ownership, and decision provenance for AI systems. The EU AI Act goes further: systems that affect safety or critical infrastructure are classified as high-risk, triggering mandatory requirements for logging, human oversight, and traceability.

These signals mean one thing: accountability can no longer be optional or implicit.

Accountability must be designed into how agentic systems make decisions and how those decisions are recorded and governed.

The point of this post is practical, not academic. If you’re building multi-agent systems in security, DevOps, or observability, you need a clear answer to three questions:

What can go wrong?
How will you know?
Who is accountable when it does?

The good news is that you can make these systems trustworthy without slowing them down. The key is to treat accountability as a product feature, not a compliance afterthought. And you probably already have the tools (hint: it’s your analytics platform) to do just that.

The Accountability Gap

Single-agent systems are already complex, but they at least have a central decision point. Multi-agent systems take it a step further, distributing decisions across multiple components.

One agent classifies the alert, another correlates with telemetry, a third chooses the response.

If the output is wrong, there may be no obvious root cause. Was it the classification prompt? The tool that provided stale data? The orchestrator that weighted the wrong agent? Or a handoff that silently dropped a warning? When your system looks like a team, it can also fail like one, with responsibility spread across roles.

This is where the accountability gap shows up in real teams. Ask a group of engineers, security analysts, and platform owners who is responsible when an agent misses an incident. You’ll get a mix of answers: the model team, the product team, the on-call team, or the vendor.

At the end of the day, you must be able to assign accountability in order to fix problems and improve the system. This means moving from "the algorithm did it" to named responsibility and documented evidence.

In other words, if an agent makes a call that leads to a bad outcome, there must be an identifiable person, team, or role that can explain the decision, the data that informed it, and the controls that were in place.

This shift is already happening inside compliance programs and audit expectations, and it’s coming to product and engineering teams next.

One solution is using your observability and analytics program as an accountability program. When agent decisions are logged with the same rigor as infrastructure events, you can connect outcomes to evidence. You know what was decided and why. That makes accountability real rather than rhetorical.

Of course, that raises the question: how do disagreements in multi-agent systems work? Let's look at that next.

Multi-Agent Conflict Resolution

When multiple agents evaluate the same input (which is common for high-stakes decisions), disagreements are inevitable. You are literally running multiple models, each with partial context and different heuristics. The important question is: how does the system handle these disagreements and whether it makes the conflicts visible to your analytics.

Voting
The simplest pattern is voting. In voting, each agent returns a decision and the majority wins. This is fast, but it can be brittle. A correlated error across two agents can drown out the one that is correct. It’s also easy to hide the disagreement, which is the worst possible choice from an accountability perspective.

The disagreement itself is a risk signal. You want it recorded and reviewable later.

Here's what a recorded voting disagreement might look like in practice:

{
    "timestamp": "2025-06-14T03:22:19.007Z",
    "trace_id": "abc-7f3a-...",
    "event": "agent_vote_conflict",
    "alert_id": "SEC-90471",
    "votes": [
      {"agent": "classifier-v2", "decision": "suppress", "confidence": 0.72},
      {"agent": "correlation-agent", "decision": "suppress", "confidence": 0.65},
      {"agent": "anomaly-detector", "decision": "escalate", "confidence": 0.88}
    ],
    "resolution": "majority_vote",
    "outcome": "suppress",
    "dissenting_agents": ["anomaly-detector"],
    "dissent_confidence_gap": 0.19
  }

Notice the dissent_confidence_gap. The dissenting agent was actually the most confident.

Now it’s time to take advantage of our analytics platform. For this (and other examples in this article), we’ll use Sumo Logic, a cloud analytics platform. In Sumo Logic we set up a scheduled search to alert when a high-confidence agent is outvoted:

_sourceCategory=agents | json "event", "dissent_confidence_gap" | where event = "agent_vote_conflict" and
  dissent_confidence_gap > 0.15

Now the silent disagreement is a flag you can review before it becomes an incident.

Negotiation
Negotiation-based systems are more flexible. One agent can propose a remediation, another bids to handle it, and the orchestrator chooses.

This approach is based on early multi-agent research, but in production it should be grounded in clear criteria and recorded choices. If a lower-cost or lower-confidence agent is chosen as the winner, that decision needs to be visible later in an incident review.

Mediator
Mediator or arbitrator agents resolve conflicts when other agents disagree. This can work well, but it changes the accountability picture. The mediator becomes a critical decision point, and its reasoning must be traceable. Usually, the mediator is a more advanced model.

Importantly, you need to know why the mediator made the decision it made. If you can’t explain why the arbitrator overruled a security warning, you haven't actually improved safety. You’ve just moved the black box.

In practice, you don’t need academic consensus protocols to get this right. You need simple rules: define how disagreement is detected, set thresholds for escalation, and make the disagreement and resolution visible in logs.

That last part is crucial. Without it, you are left with a clean output but no evidence or audit trail.

Here is a diagram that demonstrates all the methods:

Let's consider how to turn these ideas into product-level controls.

Process Frameworks for Production

The most important move you can make is to treat agent workflows like production systems, not experiments. That means clear ownership, controlled changes, and reliable telemetry.

Governance
Start with governance. An AI Quality Control function does not have to be a new department. It can be a lightweight set of responsibilities: who approves changes to prompts and thresholds, who reviews the impact of those changes, and who owns the system-level outcomes. If the system is making high-stakes decisions, those roles need to be explicit.

Decision Record
Next, define a decision record. For each agent action, capture the inputs, tool calls, outputs, confidence, and any thresholds or policies applied. A readable summary is useful for humans, but it’s not enough. You need the raw evidence. This is where analytics platforms are extremely useful.

Here's what a structured decision record looks like when ingested into Sumo Logic:

 {
    "timestamp": "2025-06-14T03:22:18.441Z",
    "trace_id": "abc-7f3a-...",
    "agent_id": "triage-classifier-v2",
    "action": "classify_alert",
    "inputs": {
      "alert_id": "SEC-90471",
      "source": "waf-east-1",
      "raw_severity": "medium"
    },
    "tool_calls": [
      {"tool": "lookup_ioc_feed", "result": "no_match", "latency_ms": 340},
      {"tool": "get_recent_alerts", "result": "3 similar in 24h", "latency_ms": 122}
    ],
    "output": {"classification": "low", "confidence": 0.72},
    "threshold_applied": "suppress_below_0.80",
    "escalated": false,
    "model": "gpt-5.2",
    "prompt_version": "classifier-v2.4.1"
  }

When every agent decision emits a record like this, you can correlate agent actions with infrastructure events. A query as simple as:

_sourceCategory=agents | json "output.confidence" as confidence, "agent_id" | where confidence < 0.80 | count by agent_id

surfaces every low-confidence decision across your fleet, creating a searchable evidence trail.

Instrument Everything
Finally, instrument everything. Observability is the bridge between "we think" and "we know." If your agents call tools, read from data stores, and write outputs, those actions should be traced end to end. OpenTelemetry is a practical, vendor-neutral way to make that happen across services and tools.

In practice, wrapping an agent decision in an OpenTelemetry span takes very little code:

from opentelemetry import trace

  tracer = trace.get_tracer("agent.triage")

  async def classify_alert(alert):
      with tracer.start_as_current_span("classify_alert") as span:
          span.set_attribute("agent.id", "triage-classifier-v2")
          span.set_attribute("agent.prompt_version", "classifier-v2.4.1")
          span.set_attribute("alert.id", alert["id"])

          result = await run_classification(alert)

          span.set_attribute("agent.confidence", result["confidence"])
          span.set_attribute("agent.decision", result["classification"])
          span.set_attribute("agent.escalated", result["confidence"] >= 0.80)
          return result

Each agent in the pipeline emits its own span under the same trace, so you get a full causal chain from alert ingestion to final decision. Then, using Sumo Logic's OpenTelemetry integration, we ingest these traces directly, letting you query across agent spans, tool calls, and infrastructure events in one place.

Next let's look at a hypothetical, but very plausible, failure of process that happens when observability is weak.

A realistic failure story (and how it happens)

Imagine a security operations team running a multi-agent triage system. One agent classifies incoming alerts, a second agent correlates with recent logs, and a third decides whether to open a ticket.

A genuine intrusion alert arrives. The classification agent labels it as low priority. The correlation agent flags a weak anomaly but sees no matching indicator. The decision agent chooses to suppress the alert. Hours later, a breach is discovered.

When the incident review begins, the team tries to answer a simple question: why was the alert suppressed?

The logs show the final decision but not the intermediate reasoning. It turns out the correlation tool was operating on stale data due to a delayed pipeline. The classification prompt had been tuned the prior week to reduce noise. The decision agent gave extra weight to the classification agent because it was historically more accurate. The system made a rational choice given its inputs. The problem is that no one can reproduce those inputs or see the disagreement that occurred.

This is the core accountability gap. The organization does not just lack a fix. It lacks a coherent explanation. And without an explanation, it can neither learn nor prove that the system is safe enough to keep in production. That is why analytics and evidence are not nice-to-haves. They are the difference between a system you can trust and one you cannot.

Now imagine the same scenario, but instrumented. The team opens Sumo Logic and runs:

_sourceCategory=agents "abc-7f3a"
    | json "agent_id", "action", "tool_calls", "output.confidence", "inputs", "trace_id"
    | where trace_id matches "abc-7f3a-*"
    | sort by _messageTime asc

They immediately see the full decision chain:

The classifier's 0.72 confidence
The correlation agent's lookup_ioc_feed call returned no_match against data that was 6 hours stale, and the decision agent's suppression.
They can see that the classifier prompt was updated two days ago by checking prompt_version. And they can see the decision agent suppressed the alert despite the anomaly detector's 0.88 confidence flag, because the classifier's low-priority label and the correlation agent's no-match result both pointed toward suppression.

Most importantly, they can set a rule so it never happens the same way again: alert when tool_calls reference a data source with freshness older than a threshold, or when a high-confidence dissent is overridden on a security-tagged alert.

The hours-long investigation now just takes minutes.

Why Does This Matter?

Business stakeholders, developers and operators care about operational outcomes: fewer false positives, faster triage, and better reliability. Multi-agent systems can deliver those outcomes, but only if the team can trust them.

And trust is not a feeling…or at least it shouldn't be. It’s a property of the system. It comes from being able to answer questions like: What did the agent see? What tools did it call? What did it ignore? Why was that alert suppressed? Who changed the thresholds last week?

This is exactly where observability helps. Your observability and analytics platform probably already collects and correlates logs and metrics at scale. The opportunity is to extend that same rigor to agentic workflows: treat agent decisions as first-class telemetry, and connect them to the infrastructure and security signals they depend on. When you do that, you can move from a black-box system to a transparent one, without sacrificing speed.

Conclusion

Multi-agent systems will become a standard part of modern operations. The teams that win with them will be the teams that treat accountability as a feature, not a burden. They will know who owns what, they will be able to trace decisions, and they will have the evidence to explain outcomes when things go wrong including the complete trajectory of every agent and cross-agent communication. That is what trust looks like, and it is what regulators, customers, and internal stakeholders are looking for.

If you are already investing in deep observability, you have most of the building blocks. The next step is to apply them to agentic systems. When AI agents get it wrong, the most important thing is not that they were wrong. It is whether you can prove what happened, learn from it, and show that your system is accountable. This also opens the door for quick improvement, so the system doesn't repeat past mistakes.

Top comments (2)

Max Quimby • Mar 28

Michael, the dissent_confidence_gap pattern is brilliant and something I haven't seen formalized elsewhere. We've hit exactly this problem — the most confident agent gets outvoted by two less-confident ones, and the majority-rules outcome ends up being wrong. Logging the dissent with confidence scores turns that from a silent failure into an actionable signal.

One extension I'd suggest: tracking dissent accuracy over time. If your anomaly-detector agent's dissents are correct 60%+ of the time when the confidence gap exceeds 0.15, that's a strong signal to either weight its votes higher or implement a "high-confidence dissent = automatic escalation to human" rule. Essentially letting the system learn which agents to trust more in which contexts.

The governance angle is also timely. Most teams I've seen are still in the "move fast, log nothing" phase with agent deployments. The NIST AI RMF and EU AI Act requirements you mention will force a reckoning — but honestly, good observability practices here aren't just compliance theater, they genuinely help you debug and improve the system. Treating your analytics platform as your accountability platform is an elegant reframe.

Armorer Labs • Jun 13

In multi-agent systems, I think accountability breaks when handoffs are just messages. Each handoff needs a small contract: goal, inputs consumed, output produced, assumptions, confidence, side effects, and what the next agent is allowed to change.

Otherwise you cannot tell whether failure came from the worker, the orchestrator, or the space between them. Disclosure: I'm building Armorer, so I think about this as an ops/run-record problem.