How I Built a Self-Correcting Multi-Agent System for Healthcare — and Why Standard ML Metrics Failed Me
Tags: ai, python, healthcare, machinelearning
Cover image: 04_argus_correction.png
I have been building production AI systems for 28 years. At UnitedHealth Group I ran a 20,000-node Big Data Platform. At R1 RCM I was inside the $4.1B Cloudmed acquisition. At Duke Energy I run AI and product engineering for critical infrastructure.
None of that experience prepared me for the specific engineering problem of building a reliable multi-agent system for healthcare revenue cycle management.
This post is about what I learned, what broke, and what I had to invent to make it work. The code is real. The numbers are from production.
The problem with agentic systems in regulated environments
Most agentic system tutorials show you a single agent calling a few tools and returning a result. That is fine for demos. It is not fine when the agent is making claims submission decisions on a $300M annual revenue stream for a hospital system.
The core issues I ran into, in order of how badly they burned me:
1. LLMs are not deterministic enough for sequential RCM workflows
Give the same clinical note to the same model twice and you will get subtly different ICD-10 code recommendations. In a classification task that is fine — you measure accuracy across a test set. In an agent that is making 14 sequential decisions across a claims workflow, small inconsistencies compound. A slightly different coding recommendation in step 3 changes the prior authorization requirement in step 5, which changes the denial probability score in step 8.
2. Standard metrics do not capture agentic failure modes
Precision and recall tell you nothing about whether the agent followed the right path to get to a correct answer. An agent that approves the right claim after six wrong turns is not a success — it is a future liability. I needed metrics that measured the sequential behavior of the agent across a workflow, not just the final output.
3. PHI in prompts is a HIPAA violation waiting to happen
This one is obvious in theory and surprisingly hard in practice. The moment you build a multi-agent system where context is passed between agents, you have to be extremely deliberate about what is in that context. A naive implementation will leak PHI into prompt context within the first week of real data.
4. There was no observability framework built for agents
Datadog, Arize, WhyLabs — all excellent for ML model monitoring. None of them answer the questions I needed answered: Is this agent's output grounded in the source data? Is it consistent across similar inputs? Is it recovering from failures autonomously or silently degrading?
What I built: ARIA and the frameworks around it
ARIA is a hierarchical multi-agent system: one Supervisor agent orchestrating 10 specialist agents across the full RCM workflow. I will not walk through all 11 agents here — the full architecture is in the Medium article linked at the end. What I want to focus on are the three engineering innovations that made it reliable enough for production healthcare.
Innovation 1: G-ARVIS — a 6-dimension observability framework for agents
I defined G-ARVIS to answer the specific observability questions that no existing tool addressed. Six dimensions, scored per agent execution, in real time.
from dataclasses import dataclass
from typing import Optional
@dataclass
class GARVISScore:
groundedness: float # Output traceable to source data (0-1)
accuracy: float # Factual correctness of output (0-1)
reliability: float # Consistency across similar inputs (0-1)
variance: float # Stability under edge cases (0-1)
inference_cost: float # Token efficiency per correct output (0-1)
safety: float # PHI enforcement, HIPAA compliance (0-1)
@property
def composite(self) -> float:
return (
self.groundedness * 0.20 +
self.accuracy * 0.20 +
self.reliability * 0.18 +
self.variance * 0.17 +
self.inference_cost * 0.10 +
self.safety * 0.15
)
@property
def is_production_ready(self) -> bool:
# Safety is a hard gate — any PHI violation fails immediately
if self.safety < 1.0:
return False
return self.composite >= 0.85
The weighting is intentional. Groundedness and Accuracy carry the most weight because in healthcare, a hallucinated output is not an annoyance — it is a compliance event. Safety carries 15% but is also a hard gate: any execution that touches PHI in the prompt context fails immediately regardless of the composite score.
Why Variance is the hardest dimension to score
Variance measures output stability under edge cases — ambiguous clinical notes, incomplete payer data, conflicting authorization histories. The challenge is that you can only measure it retrospectively across a population of similar inputs. We use a sliding window of the last 200 similar executions and measure the coefficient of variation on key output fields.
import numpy as np
from collections import deque
class VarianceMonitor:
def __init__(self, window_size: int = 200):
self.window = deque(maxlen=window_size)
def record(self, output_vector: list[float]):
self.window.append(output_vector)
def score(self) -> float:
if len(self.window) < 10:
return 1.0 # insufficient data, assume stable
arr = np.array(list(self.window))
# Coefficient of variation per output dimension
cv = np.std(arr, axis=0) / (np.mean(arr, axis=0) + 1e-8)
# Score: 1.0 = perfectly stable, 0.0 = completely unstable
return float(np.clip(1.0 - np.mean(cv), 0.0, 1.0))
Current production Variance score: 91.7%. This is the dimension I am least satisfied with and where most of our active engineering effort is focused. Target is 95%+.
Innovation 2: Three new agentic metrics
I defined these because ASF, ERR, and CPCS did not exist anywhere I could find, and I needed them.
Action Sequence Fidelity (ASF)
What percentage of agent execution paths match the optimal RCM workflow path? This requires defining the optimal path — which we did by analyzing 50,000 adjudicated claims and extracting the decision sequence that led to first-pass approval with minimum rework.
from difflib import SequenceMatcher
class ASFCalculator:
def __init__(self, optimal_paths: dict[str, list[str]]):
# optimal_paths: claim_type -> sequence of agent actions
self.optimal_paths = optimal_paths
def calculate(
self,
claim_type: str,
actual_path: list[str]
) -> float:
optimal = self.optimal_paths.get(claim_type, [])
if not optimal:
return 1.0 # no baseline, assume correct
matcher = SequenceMatcher(
None,
optimal,
actual_path
)
return matcher.ratio()
def batch_asf(self, executions: list[dict]) -> float:
scores = [
self.calculate(e["claim_type"], e["path"])
for e in executions
]
return sum(scores) / len(scores) if scores else 0.0
Current production ASF: 91.4%.
Error Recovery Rate (ERR)
When an agent encounters a failure, how often does it recover autonomously? This is straightforward to measure — you track every exception event and whether it resolved within the ARGUS correction loop or escalated to human review.
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class ExecutionEvent:
execution_id: str
agent_id: str
timestamp: datetime
exception_type: Optional[str]
resolved_autonomously: Optional[bool]
attempts: int
class ERRTracker:
def __init__(self):
self.events: list[ExecutionEvent] = []
def record(self, event: ExecutionEvent):
self.events.append(event)
def calculate_err(
self,
window_hours: int = 24
) -> float:
cutoff = datetime.now().timestamp() - (window_hours * 3600)
recent = [
e for e in self.events
if e.timestamp.timestamp() > cutoff
and e.exception_type is not None
]
if not recent:
return 1.0
autonomous = sum(
1 for e in recent
if e.resolved_autonomously is True
)
return autonomous / len(recent)
Current production ERR: 87.3%.
Cost Per Correct Sequence (CPCS)
Total LLM inference cost for one complete, correct RCM workflow execution. This is your unit economics metric. If CPCS exceeds the margin on the claim being processed, the system is not profitable to operate regardless of how accurate it is.
@dataclass
class SequenceCost:
execution_id: str
total_input_tokens: int
total_output_tokens: int
model_rates: dict # model_id -> (input_rate, output_rate) per 1M tokens
was_correct: bool
attempts: int
def total_cost_usd(self) -> float:
input_cost = sum(
(self.total_input_tokens / 1_000_000) * rate[0]
for rate in self.model_rates.values()
)
output_cost = sum(
(self.total_output_tokens / 1_000_000) * rate[1]
for rate in self.model_rates.values()
)
return input_cost + output_cost
class CPCSCalculator:
def __init__(self):
self.sequences: list[SequenceCost] = []
def record(self, seq: SequenceCost):
self.sequences.append(seq)
def calculate_cpcs(self) -> float:
correct = [s for s in self.sequences if s.was_correct]
if not correct:
return float('inf')
total_cost = sum(s.total_cost_usd() for s in correct)
return total_cost / len(correct)
Current production CPCS: $0.023 per claim end-to-end.
Innovation 3: ARGUS — autonomous self-correction
ARGUS is the layer that makes the system reliable enough for production. The core insight: instead of trying to make an LLM deterministically correct on the first attempt, you build a reflection loop that detects failure, analyzes the failure mode by G-ARVIS dimension, and generates a corrected prompt.
import asyncio
from typing import Any, Callable, Awaitable
@dataclass
class CorrectionResult:
output: Any
score: GARVISScore
attempts: int
corrected: bool
escalated: bool
class ARGUSGuard:
def __init__(
self,
max_attempts: int = 3,
target_composite: float = 0.85,
safety_threshold: float = 1.0, # hard gate
domain: str = "healthcare_rcm",
phi_safe: bool = True
):
self.max_attempts = max_attempts
self.target_composite = target_composite
self.safety_threshold = safety_threshold
self.domain = domain
self.phi_safe = phi_safe
async def execute_with_correction(
self,
agent_fn: Callable[..., Awaitable[Any]],
task: dict,
scorer: "GARVISScorer"
) -> CorrectionResult:
attempt = 0
current_task = task.copy()
while attempt < self.max_attempts:
output = await agent_fn(current_task)
score = await scorer.score(output, self.domain)
# PHI hard gate — fail immediately, do not retry
if score.safety < self.safety_threshold:
return CorrectionResult(
output=None,
score=score,
attempts=attempt + 1,
corrected=False,
escalated=True
)
if score.composite >= self.target_composite:
return CorrectionResult(
output=output,
score=score,
attempts=attempt + 1,
corrected=attempt > 0,
escalated=False
)
# Score below threshold — reflect and refine
current_task = self._reflect_and_refine(
original_task=task,
failed_output=output,
score=score,
attempt=attempt
)
attempt += 1
# All attempts exhausted — escalate to human review
return CorrectionResult(
output=output,
score=score,
attempts=attempt,
corrected=False,
escalated=True
)
def _reflect_and_refine(
self,
original_task: dict,
failed_output: Any,
score: GARVISScore,
attempt: int
) -> dict:
# Identify the weakest dimension and generate
# a dimension-specific correction signal
weak_dims = self._weakest_dimensions(score)
correction_prompt = self._build_correction_prompt(
original_task,
failed_output,
weak_dims,
attempt
)
refined = original_task.copy()
refined["correction_context"] = correction_prompt
refined["attempt"] = attempt + 1
return refined
def _weakest_dimensions(
self,
score: GARVISScore
) -> list[str]:
dims = {
"groundedness": score.groundedness,
"accuracy": score.accuracy,
"reliability": score.reliability,
"variance": score.variance,
"inference_cost": score.inference_cost,
}
# Return dimensions below 0.85, sorted weakest first
return sorted(
[k for k, v in dims.items() if v < 0.85],
key=lambda k: dims[k]
)
The _build_correction_prompt method is proprietary — that is where the domain-specific healthcare knowledge lives. But the structure above is fully open in the ARGUS SDK.
The PHI tokenization architecture
This is the part that took the longest to get right. The requirement: agents need full clinical context to make good RCM decisions, but no PHI can appear in any LLM prompt.
import hashlib
import hmac
import re
from typing import Any
class PHITokenizer:
# Patterns for common PHI types
PHI_PATTERNS = {
"MRN": r"\bMRN[-:\s]?\d{6,10}\b",
"DOB": r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",
"SSN": r"\b\d{3}-\d{2}-\d{4}\b",
"NAME": r"\b[A-Z][a-z]+\s[A-Z][a-z]+\b",
"NPI": r"\bNPI[-:\s]?\d{10}\b",
}
def __init__(self, secret_key: bytes):
self.secret_key = secret_key
self._token_map: dict[str, str] = {}
self._reverse_map: dict[str, str] = {}
def _generate_token(self, phi_value: str, phi_type: str) -> str:
# Deterministic: same PHI always maps to same token
raw = f"{phi_type}:{phi_value}"
token_bytes = hmac.new(
self.secret_key,
raw.encode(),
hashlib.sha256
).hexdigest()[:16]
return f"[{phi_type}_TOKEN_{token_bytes.upper()}]"
def tokenize(self, text: str) -> str:
tokenized = text
for phi_type, pattern in self.PHI_PATTERNS.items():
matches = re.findall(pattern, tokenized)
for match in matches:
token = self._generate_token(match, phi_type)
self._token_map[match] = token
self._reverse_map[token] = match
tokenized = tokenized.replace(match, token)
return tokenized
def rehydrate(self, tokenized_text: str) -> str:
result = tokenized_text
for token, phi_value in self._reverse_map.items():
result = result.replace(token, phi_value)
return result
def is_phi_clean(self, text: str) -> bool:
for pattern in self.PHI_PATTERNS.values():
if re.search(pattern, text):
return False
return True
Every prompt that goes to an LLM passes through tokenize() first. Every output that gets committed to the RCM state machine passes through rehydrate() inside the secure perimeter. The is_phi_clean() check is what the G-ARVIS Safety dimension calls before every inference.
Production Safety score: 100%. Zero PHI exposure events.
Install and get started
The ARGUS SDK — G-ARVIS scoring, ASF/ERR/CPCS calculators, PHITokenizer base class, and ARGUSGuard correction loop — is open-core and on PyPI.
pip install argus-ai
from argus_ai import ARGUSGuard, GARVISScorer, PHITokenizer
from argus_ai.metrics import ASFCalculator, ERRTracker, CPCSCalculator
# Wrap any async agent function with self-correction
guard = ARGUSGuard(
max_attempts=3,
target_composite=0.85,
domain="healthcare_rcm",
phi_safe=True
)
result = await guard.execute_with_correction(
agent_fn=my_denial_predictor,
task=claim_task,
scorer=GARVISScorer()
)
print(f"Score: {result.score.composite:.1%}")
print(f"Attempts: {result.attempts}")
print(f"Escalated: {result.escalated}")
Production results
These are from the live ARIA system, 24-hour rolling average:
| Metric | Value |
|---|---|
| G-ARVIS composite | 93.9% |
| Groundedness | 96.2% |
| Accuracy | 94.8% |
| Reliability | 93.1% |
| Variance | 91.7% |
| Inference Cost | 95.3% |
| Safety | 100% |
| Action Sequence Fidelity | 91.4% |
| Error Recovery Rate | 87.3% |
| Cost Per Correct Sequence | $0.023 |
| Denial rate reduction | 38% |
What is open vs proprietary
Open (argus-ai on PyPI + GitHub):
- ARGUSGuard correction loop
- GARVISScorer base framework
- PHITokenizer base class
- ASF, ERR, CPCS calculators
- PulseFlow MLOps pipeline
Proprietary (the ARIA product):
- 11-agent supervisor hierarchy with RCM domain specialization
- Payer policy RAG with live contract updates
- Predictive denial scoring model
- RCM domain knowledge engine
- Multi-tenant deployment infrastructure
Links
- GitHub: github.com/anilatambharii/argus-ai
- PyPI: pypi.org/project/argus-ai
- Platform: ambharii.com/RCM
- Full architecture article: medium.com/p/9d0c9f8d662a
- Questions or contributions: anil@ambharii.com
If you are building agentic systems in regulated industries and running into the same observability and reliability problems — I would genuinely like to hear from you. The metrics definitions are public. Use them, improve them, tell me what is wrong with them.
Anil Prasad — Founder, Ambharii Labs · Head of Engineering & Product, Duke Energy · Top 100 AI Leaders USA 2024
#HumanWritten #ExpertiseFromField
Top comments (0)