Building a Compliant AI Agent System: Lessons from 347 Production Agents

#ai

When we started routing real compliance workloads through multi-agent systems, we expected the hard problems to be latency, cost, and model selection. We were wrong. The hardest problem was proving what happened — and when it happened, and which model made which decision, and why.

This is the problem that most multi-agent frameworks treat as an afterthought. It is not an afterthought. For regulated industries, it is the entire product.

The Orchestration Problem Nobody Talks About

Most discussions of multi-agent AI focus on capability: can you chain agents together, can you parallelize tasks, can you route to the right specialist? These are solvable problems. The frameworks are mature.

What they do not solve: compliance verification at inference time.

When you run 20 competing agents against a single compliance question — say, "Does this AI system fall under the EU AI Act's high-risk classification?" — you get 20 different answers, 20 different confidence scores, and 20 different reasoning paths. To pick the winner, you need more than a voting mechanism. You need a verifiable selection protocol that you can explain to an auditor six months later.

This is where most teams reach for logs. Logs are insufficient. Logs tell you what happened. They do not tell you why a specific agent's output was selected, what alternatives were considered, or whether the selection criterion was applied consistently. These are the questions regulators actually ask.

What Compliance Verification at Inference Time Requires

We identified four requirements that any serious compliance AI architecture must satisfy:

1. Decision chain visibility — Every agent invocation must produce an immutable record: which model, which version, which prompt, what input, what output, what latency, what cost. Not a summary — the full artifact.

2. Selection auditability — When multiple agents compete on the same query, the selection mechanism must be deterministic and explainable. "The highest-confidence response won" is not an explanation. The selection criteria, the scoring rubric, and the inputs to the scoring function must all be captured.

3. Contradiction detection — When two agents produce contradictory outputs on a compliance question, that contradiction is itself a compliance signal. A robust system should surface these disagreements rather than silently resolving them through averaging or majority vote.

4. Replay capability — An auditor should be able to replay any historical query against the same agents, with the same prompts, and verify that the output is stable. This requires versioning not just the data, but the agent configurations.

The GSAR Pipeline Architecture

One architectural pattern that addresses these requirements is what we call a GSAR pipeline: Generate, Score, Audit, Return.

Generate: Submit the query to N specialized agents in parallel. Each agent is domain-scoped — one trained on EU AI Act provisions, one on HIPAA, one on SOC 2 controls, and so on. Parallelization keeps latency under control; you are not serializing 20 LLM calls.

Score: Apply a deterministic scoring function to the N outputs. The scoring function is itself versioned and auditable. It considers semantic consistency, source citation quality, confidence calibration, and response completeness. The scores are stored alongside the raw outputs.

Audit: Before returning any output, run it through a compliance gate. The gate checks: Does this output contain legally actionable language without appropriate disclaimers? Does it contradict established regulatory guidance in this jurisdiction? Has this specific agent configuration been flagged for quality issues in the past 30 days?

Return: The winning output, plus the full audit trail: all competing outputs, all scores, the scoring rationale, the audit gate decision, and a provenance hash that links this response to the specific agent configurations and model versions used.

The audit trail is not optional. It is the product.

Biomimetic Auction: A Different Selection Model

Traditional agent orchestration frameworks pick winners through static routing rules or simple quality metrics. Neither approach scales to high-stakes compliance scenarios.

A more robust model is competitive selection — letting agents bid for the query based on their self-assessed competence, then validating those bids against historical performance. This is analogous to how biological immune systems work: specialized cells compete to respond to a specific antigen, and the most relevant responders are selected through a competitive process rather than a predetermined hierarchy.

The practical implementation requires each agent to maintain a performance model: how often has this agent type been correct on queries in this specific domain, at this confidence level, in the past N days? Agents with higher historical accuracy are given more weight.

In production systems running at sub-18ms latency, the overhead of this competitive selection mechanism is consistently under 4ms — well within acceptable bounds for synchronous compliance queries.

Building the Audit Trail

The core challenge is immutability under update. When a compliance finding is later revised — because a regulation changed, or because a model error was identified — you need to update the finding without losing history. This requires treating compliance outputs as event records, not mutable state.

The pattern:

Compliance events table:
  - query_id (immutable after creation)
  - agent_run_id (links to full agent execution record)
  - output (verbatim response, never updated)
  - selection_score (frozen at generation time)
  - audit_gate_result (pass/fail/flag, frozen)
  - superseded_by (null if current, query_id if revised)
  - created_at (immutable)

Revisions create new records that reference the old ones. You never update a compliance output — you supersede it. For regulated industries, this is the minimum viable audit trail.

Safety Gates at Inference Time

Safety gates for compliance AI should check at minimum:

Jurisdictional scope: Is this output scoped to the correct legal jurisdiction?
Confidence thresholds: Low-confidence outputs on compliance questions should be flagged, not silently returned.
Regulatory currency: Has the underlying regulation been updated since this agent's training cutoff?
Contradiction with established guidance: Does this output contradict published regulatory guidance?

Implementing these gates creates the artifact that makes compliance AI defensible: a system that can demonstrate, for every output it has ever produced, that it passed a documented validation process.

What 347 Production Agents Taught Us

Confidence calibration degrades faster than accuracy. A model that produces correct outputs 90% of the time may still be dangerous if its confidence scores are miscalibrated — if it expresses high confidence on the 10% of cases where it is wrong. Monitoring confidence calibration as a first-class metric is essential.

Inter-agent disagreement is a leading indicator of regulatory change. When agents that previously agreed on a question start disagreeing, it usually means the underlying regulatory landscape has shifted. This is a valuable signal most frameworks throw away.

Audit trail completeness has a direct effect on enterprise sales velocity. Enterprises in regulated industries do not buy compliance AI that cannot produce an audit trail. The audit trail is not a feature — it is the table stakes for the conversation.

For the technical deep dive on our architecture: sturna.ai/whitepaper. For reproducible benchmark results: sturna.ai/benchmarks-vs.