Stephen Trembley

Posted on May 9

Building a Compliant AI Agent System: Lessons from 347 Production Agents

#ai #compliance #aisafety #enterpriseai

When we started building a multi-agent compliance system, we thought the hard part would be making agents accurate. We were wrong. The hard part is making them auditable.

This post covers the architectural patterns we discovered while running 347 production AI agents across regulated industries — financial services, healthcare, and government contracting. If you're building multi-agent systems that need to survive a compliance audit, this is for you.

The Multi-Agent Orchestration Problem

Single-agent architectures are straightforward to reason about. One model, one prompt, one output, one audit trail. The moment you introduce multiple agents — each with different specializations, competing recommendations, and varying confidence levels — you create a compliance nightmare.

Here's why: regulators don't care about your architecture diagram. They care about who decided what, when, and why. In a multi-agent system, "who" is ambiguous. Did the summarization agent make the call? The risk-scoring agent? The orchestrator that chose between them?

We identified three core challenges that every multi-agent compliance system must solve:

1. Decision Attribution

When Agent A provides input to Agent B, which then triggers Agent C to produce a final recommendation — who owns the decision? Traditional audit trails capture the final output. Compliance requires capturing the decision chain.

The pattern that works: treat every agent interaction as a signed transaction. Each agent emits a structured decision record containing its input context, reasoning trace, confidence score, and output. The orchestrator maintains a directed acyclic graph (DAG) of these records. Any compliance query — "why did the system recommend X?" — becomes a graph traversal.

DecisionRecord {
  agent_id: string
  timestamp: ISO8601
  input_hash: SHA256
  reasoning_trace: string[]
  confidence: float
  output_hash: SHA256
  parent_records: string[]
}

This isn't optional decoration. Under SOC 2 Type II, HIPAA, and SEC Rule 206(4)-7, you need to demonstrate that automated decisions are traceable to their inputs. A flat log file doesn't cut it.

2. Compliance Verification at Inference Time

Most teams bolt compliance checks on after the agent produces output. This is backwards. By the time you're checking whether an output violates a regulation, you've already spent the compute, introduced latency, and created a failure mode where the check itself can silently fail.

The pattern that works: compliance gates embedded in the inference pipeline. Before an agent's output is accepted by the orchestrator, it passes through a lightweight verification layer that checks:

Regulatory boundary violations: Does this output reference data the agent isn't authorized to access under the applicable regulation?
Confidence thresholds: Is the agent's confidence above the minimum required for this decision category?
Contradiction detection: Does this output contradict another agent's output that's already been accepted in the same decision chain?
PII leakage scanning: Does the output contain personally identifiable information that shouldn't propagate downstream?

The key insight: these gates must execute in under 20ms to avoid degrading the user experience. That rules out calling another LLM for verification. We use a combination of rule engines, embedding similarity checks, and pre-compiled regulatory boundary maps.

At Sturna, we've gotten this down to sub-18ms per gate by using what we call a biomimetic auction architecture — agents compete rather than collaborate, and compliance verification happens as part of the auction scoring, not as a separate step. The result is that non-compliant outputs never even enter the candidate pool. More on the architecture in our technical whitepaper.

3. The Audit Trail Architecture

Here's where most teams get it wrong: they think "audit trail" means "logging." It doesn't. An audit trail for a multi-agent system needs to support four distinct query patterns:

Forward trace: Given an input, show every agent that touched it and what they did.
Backward trace: Given an output, show the complete chain of decisions that produced it.
Temporal query: Show me everything the system did between T1 and T2 for entity X.
Counterfactual query: If agent A had produced output Y instead of Z, would the final decision have changed?

The last one is what separates "we have logs" from "we're audit-ready." Regulators increasingly want to understand not just what happened, but what would have happened under different conditions. This is especially critical for financial services under the SEC's new AI guidance.

The architecture pattern: an append-only event store with materialized views for each query pattern. Every agent interaction produces an immutable event. Views are rebuilt on read, never mutated. This gives you:

Tamper-evident history (append-only = no retroactive edits)
Efficient querying (materialized views optimized per pattern)
Reproducibility (replay events to verify any historical decision)

The GSAR Pipeline

We formalized our approach as the GSAR pipeline: Gate, Score, Audit, Report.

Gate: Compliance verification at inference time. Non-compliant outputs are rejected before they enter the decision chain.

Score: Every accepted output receives a composite compliance score based on confidence, regulatory alignment, and consistency with the decision chain.

Audit: The append-only event store captures every gate pass/fail, every score, and every agent interaction. Immutable, tamper-evident, queryable.

Report: Automated compliance report generation. Given a time range and entity, produce a complete decision history with regulatory citations.

The GSAR pipeline runs at every agent interaction, not as a batch process. Real-time compliance is the only kind that matters when you're making decisions at inference speed.

What We Learned from 347 Agents

Running this at scale taught us things the architecture diagrams don't show:

Agent drift is the silent killer. An agent that was compliant on day one gradually drifts as its context window accumulates edge cases. We now run weekly compliance regression tests — replay historical inputs and verify outputs still pass all gates.

Confidence calibration is harder than accuracy. An agent that's 90% accurate but thinks it's 99% confident is more dangerous than an agent that's 80% accurate and knows it.

The audit trail is the product. We initially built the audit system for regulators. Turns out, our customers use it more than we do.

Latency budgets are non-negotiable. Compliance verification that adds 500ms per request will be turned off by engineering within a month. Our 18ms budget is the maximum tolerable overhead.

Try It Yourself

If you're building multi-agent systems for regulated industries, the patterns above will save you months. The implementation details are in our technical whitepaper.

For reproducible benchmarks, see sturna.ai/benchmarks-vs.

Built by the team at Sturna.ai — compliance intelligence for AI agent systems.

DEV Community