DEV Community

Cover image for How I Built a Self-Correcting Multi-Agent System for Healthcare - and Why Standard ML Metrics Failed Me
Anil Prasad
Anil Prasad

Posted on • Originally published at Medium

How I Built a Self-Correcting Multi-Agent System for Healthcare - and Why Standard ML Metrics Failed Me

How I Built a Self-Correcting Multi-Agent System for Healthcare — and Why Standard ML Metrics Failed Me

Tags: ai, python, healthcare, machinelearning

Cover image: 04_argus_correction.png


I have been building production AI systems for 28 years. At UnitedHealth Group I ran a 20,000-node Big Data Platform. At R1 RCM I was inside the $4.1B Cloudmed acquisition. At Duke Energy I run AI and product engineering for critical infrastructure.

None of that experience prepared me for the specific engineering problem of building a reliable multi-agent system for healthcare revenue cycle management.

This post is about what I learned, what broke, and what I had to invent to make it work. The code is real. The numbers are from production.


The problem with agentic systems in regulated environments

Most agentic system tutorials show you a single agent calling a few tools and returning a result. That is fine for demos. It is not fine when the agent is making claims submission decisions on a $300M annual revenue stream for a hospital system.

The core issues I ran into, in order of how badly they burned me:

1. LLMs are not deterministic enough for sequential RCM workflows

Give the same clinical note to the same model twice and you will get subtly different ICD-10 code recommendations. In a classification task that is fine — you measure accuracy across a test set. In an agent that is making 14 sequential decisions across a claims workflow, small inconsistencies compound. A slightly different coding recommendation in step 3 changes the prior authorization requirement in step 5, which changes the denial probability score in step 8.

2. Standard metrics do not capture agentic failure modes

Precision and recall tell you nothing about whether the agent followed the right path to get to a correct answer. An agent that approves the right claim after six wrong turns is not a success — it is a future liability. I needed metrics that measured the sequential behavior of the agent across a workflow, not just the final output.

3. PHI in prompts is a HIPAA violation waiting to happen

This one is obvious in theory and surprisingly hard in practice. The moment you build a multi-agent system where context is passed between agents, you have to be extremely deliberate about what is in that context. A naive implementation will leak PHI into prompt context within the first week of real data.

4. There was no observability framework built for agents

Datadog, Arize, WhyLabs — all excellent for ML model monitoring. None of them answer the questions I needed answered: Is this agent's output grounded in the source data? Is it consistent across similar inputs? Is it recovering from failures autonomously or silently degrading?


What I built: ARIA and the frameworks around it

ARIA is a hierarchical multi-agent system: one Supervisor agent orchestrating 10 specialist agents across the full RCM workflow. I will not walk through all 11 agents here — the full architecture is in the Medium article linked at the end. What I want to focus on are the three engineering innovations that made it reliable enough for production healthcare.


Innovation 1: G-ARVIS — a 6-dimension observability framework for agents

I defined G-ARVIS to answer the specific observability questions that no existing tool addressed. Six dimensions, scored per agent execution, in real time.

from dataclasses import dataclass
from typing import Optional

@dataclass
class GARVISScore:
    groundedness: float    # Output traceable to source data (0-1)
    accuracy: float        # Factual correctness of output (0-1)
    reliability: float     # Consistency across similar inputs (0-1)
    variance: float        # Stability under edge cases (0-1)
    inference_cost: float  # Token efficiency per correct output (0-1)
    safety: float          # PHI enforcement, HIPAA compliance (0-1)

    @property
    def composite(self) -> float:
        return (
            self.groundedness * 0.20 +
            self.accuracy     * 0.20 +
            self.reliability  * 0.18 +
            self.variance     * 0.17 +
            self.inference_cost * 0.10 +
            self.safety       * 0.15
        )

    @property
    def is_production_ready(self) -> bool:
        # Safety is a hard gate — any PHI violation fails immediately
        if self.safety < 1.0:
            return False
        return self.composite >= 0.85
Enter fullscreen mode Exit fullscreen mode

The weighting is intentional. Groundedness and Accuracy carry the most weight because in healthcare, a hallucinated output is not an annoyance — it is a compliance event. Safety carries 15% but is also a hard gate: any execution that touches PHI in the prompt context fails immediately regardless of the composite score.

Why Variance is the hardest dimension to score

Variance measures output stability under edge cases — ambiguous clinical notes, incomplete payer data, conflicting authorization histories. The challenge is that you can only measure it retrospectively across a population of similar inputs. We use a sliding window of the last 200 similar executions and measure the coefficient of variation on key output fields.

import numpy as np
from collections import deque

class VarianceMonitor:
    def __init__(self, window_size: int = 200):
        self.window = deque(maxlen=window_size)

    def record(self, output_vector: list[float]):
        self.window.append(output_vector)

    def score(self) -> float:
        if len(self.window) < 10:
            return 1.0  # insufficient data, assume stable
        arr = np.array(list(self.window))
        # Coefficient of variation per output dimension
        cv = np.std(arr, axis=0) / (np.mean(arr, axis=0) + 1e-8)
        # Score: 1.0 = perfectly stable, 0.0 = completely unstable
        return float(np.clip(1.0 - np.mean(cv), 0.0, 1.0))
Enter fullscreen mode Exit fullscreen mode

Current production Variance score: 91.7%. This is the dimension I am least satisfied with and where most of our active engineering effort is focused. Target is 95%+.


Innovation 2: Three new agentic metrics

I defined these because ASF, ERR, and CPCS did not exist anywhere I could find, and I needed them.

Action Sequence Fidelity (ASF)

What percentage of agent execution paths match the optimal RCM workflow path? This requires defining the optimal path — which we did by analyzing 50,000 adjudicated claims and extracting the decision sequence that led to first-pass approval with minimum rework.

from difflib import SequenceMatcher

class ASFCalculator:
    def __init__(self, optimal_paths: dict[str, list[str]]):
        # optimal_paths: claim_type -> sequence of agent actions
        self.optimal_paths = optimal_paths

    def calculate(
        self,
        claim_type: str,
        actual_path: list[str]
    ) -> float:
        optimal = self.optimal_paths.get(claim_type, [])
        if not optimal:
            return 1.0  # no baseline, assume correct

        matcher = SequenceMatcher(
            None,
            optimal,
            actual_path
        )
        return matcher.ratio()

    def batch_asf(self, executions: list[dict]) -> float:
        scores = [
            self.calculate(e["claim_type"], e["path"])
            for e in executions
        ]
        return sum(scores) / len(scores) if scores else 0.0
Enter fullscreen mode Exit fullscreen mode

Current production ASF: 91.4%.

Error Recovery Rate (ERR)

When an agent encounters a failure, how often does it recover autonomously? This is straightforward to measure — you track every exception event and whether it resolved within the ARGUS correction loop or escalated to human review.

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class ExecutionEvent:
    execution_id: str
    agent_id: str
    timestamp: datetime
    exception_type: Optional[str]
    resolved_autonomously: Optional[bool]
    attempts: int

class ERRTracker:
    def __init__(self):
        self.events: list[ExecutionEvent] = []

    def record(self, event: ExecutionEvent):
        self.events.append(event)

    def calculate_err(
        self,
        window_hours: int = 24
    ) -> float:
        cutoff = datetime.now().timestamp() - (window_hours * 3600)
        recent = [
            e for e in self.events
            if e.timestamp.timestamp() > cutoff
            and e.exception_type is not None
        ]
        if not recent:
            return 1.0

        autonomous = sum(
            1 for e in recent
            if e.resolved_autonomously is True
        )
        return autonomous / len(recent)
Enter fullscreen mode Exit fullscreen mode

Current production ERR: 87.3%.

Cost Per Correct Sequence (CPCS)

Total LLM inference cost for one complete, correct RCM workflow execution. This is your unit economics metric. If CPCS exceeds the margin on the claim being processed, the system is not profitable to operate regardless of how accurate it is.

@dataclass
class SequenceCost:
    execution_id: str
    total_input_tokens: int
    total_output_tokens: int
    model_rates: dict  # model_id -> (input_rate, output_rate) per 1M tokens
    was_correct: bool
    attempts: int

    def total_cost_usd(self) -> float:
        input_cost = sum(
            (self.total_input_tokens / 1_000_000) * rate[0]
            for rate in self.model_rates.values()
        )
        output_cost = sum(
            (self.total_output_tokens / 1_000_000) * rate[1]
            for rate in self.model_rates.values()
        )
        return input_cost + output_cost

class CPCSCalculator:
    def __init__(self):
        self.sequences: list[SequenceCost] = []

    def record(self, seq: SequenceCost):
        self.sequences.append(seq)

    def calculate_cpcs(self) -> float:
        correct = [s for s in self.sequences if s.was_correct]
        if not correct:
            return float('inf')
        total_cost = sum(s.total_cost_usd() for s in correct)
        return total_cost / len(correct)
Enter fullscreen mode Exit fullscreen mode

Current production CPCS: $0.023 per claim end-to-end.


Innovation 3: ARGUS — autonomous self-correction

ARGUS is the layer that makes the system reliable enough for production. The core insight: instead of trying to make an LLM deterministically correct on the first attempt, you build a reflection loop that detects failure, analyzes the failure mode by G-ARVIS dimension, and generates a corrected prompt.

import asyncio
from typing import Any, Callable, Awaitable

@dataclass
class CorrectionResult:
    output: Any
    score: GARVISScore
    attempts: int
    corrected: bool
    escalated: bool

class ARGUSGuard:
    def __init__(
        self,
        max_attempts: int = 3,
        target_composite: float = 0.85,
        safety_threshold: float = 1.0,  # hard gate
        domain: str = "healthcare_rcm",
        phi_safe: bool = True
    ):
        self.max_attempts = max_attempts
        self.target_composite = target_composite
        self.safety_threshold = safety_threshold
        self.domain = domain
        self.phi_safe = phi_safe

    async def execute_with_correction(
        self,
        agent_fn: Callable[..., Awaitable[Any]],
        task: dict,
        scorer: "GARVISScorer"
    ) -> CorrectionResult:

        attempt = 0
        current_task = task.copy()

        while attempt < self.max_attempts:
            output = await agent_fn(current_task)
            score = await scorer.score(output, self.domain)

            # PHI hard gate — fail immediately, do not retry
            if score.safety < self.safety_threshold:
                return CorrectionResult(
                    output=None,
                    score=score,
                    attempts=attempt + 1,
                    corrected=False,
                    escalated=True
                )

            if score.composite >= self.target_composite:
                return CorrectionResult(
                    output=output,
                    score=score,
                    attempts=attempt + 1,
                    corrected=attempt > 0,
                    escalated=False
                )

            # Score below threshold — reflect and refine
            current_task = self._reflect_and_refine(
                original_task=task,
                failed_output=output,
                score=score,
                attempt=attempt
            )
            attempt += 1

        # All attempts exhausted — escalate to human review
        return CorrectionResult(
            output=output,
            score=score,
            attempts=attempt,
            corrected=False,
            escalated=True
        )

    def _reflect_and_refine(
        self,
        original_task: dict,
        failed_output: Any,
        score: GARVISScore,
        attempt: int
    ) -> dict:
        # Identify the weakest dimension and generate
        # a dimension-specific correction signal
        weak_dims = self._weakest_dimensions(score)
        correction_prompt = self._build_correction_prompt(
            original_task,
            failed_output,
            weak_dims,
            attempt
        )
        refined = original_task.copy()
        refined["correction_context"] = correction_prompt
        refined["attempt"] = attempt + 1
        return refined

    def _weakest_dimensions(
        self,
        score: GARVISScore
    ) -> list[str]:
        dims = {
            "groundedness": score.groundedness,
            "accuracy": score.accuracy,
            "reliability": score.reliability,
            "variance": score.variance,
            "inference_cost": score.inference_cost,
        }
        # Return dimensions below 0.85, sorted weakest first
        return sorted(
            [k for k, v in dims.items() if v < 0.85],
            key=lambda k: dims[k]
        )
Enter fullscreen mode Exit fullscreen mode

The _build_correction_prompt method is proprietary — that is where the domain-specific healthcare knowledge lives. But the structure above is fully open in the ARGUS SDK.


The PHI tokenization architecture

This is the part that took the longest to get right. The requirement: agents need full clinical context to make good RCM decisions, but no PHI can appear in any LLM prompt.

import hashlib
import hmac
import re
from typing import Any

class PHITokenizer:
    # Patterns for common PHI types
    PHI_PATTERNS = {
        "MRN":   r"\bMRN[-:\s]?\d{6,10}\b",
        "DOB":   r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",
        "SSN":   r"\b\d{3}-\d{2}-\d{4}\b",
        "NAME":  r"\b[A-Z][a-z]+\s[A-Z][a-z]+\b",
        "NPI":   r"\bNPI[-:\s]?\d{10}\b",
    }

    def __init__(self, secret_key: bytes):
        self.secret_key = secret_key
        self._token_map: dict[str, str] = {}
        self._reverse_map: dict[str, str] = {}

    def _generate_token(self, phi_value: str, phi_type: str) -> str:
        # Deterministic: same PHI always maps to same token
        raw = f"{phi_type}:{phi_value}"
        token_bytes = hmac.new(
            self.secret_key,
            raw.encode(),
            hashlib.sha256
        ).hexdigest()[:16]
        return f"[{phi_type}_TOKEN_{token_bytes.upper()}]"

    def tokenize(self, text: str) -> str:
        tokenized = text
        for phi_type, pattern in self.PHI_PATTERNS.items():
            matches = re.findall(pattern, tokenized)
            for match in matches:
                token = self._generate_token(match, phi_type)
                self._token_map[match] = token
                self._reverse_map[token] = match
                tokenized = tokenized.replace(match, token)
        return tokenized

    def rehydrate(self, tokenized_text: str) -> str:
        result = tokenized_text
        for token, phi_value in self._reverse_map.items():
            result = result.replace(token, phi_value)
        return result

    def is_phi_clean(self, text: str) -> bool:
        for pattern in self.PHI_PATTERNS.values():
            if re.search(pattern, text):
                return False
        return True
Enter fullscreen mode Exit fullscreen mode

Every prompt that goes to an LLM passes through tokenize() first. Every output that gets committed to the RCM state machine passes through rehydrate() inside the secure perimeter. The is_phi_clean() check is what the G-ARVIS Safety dimension calls before every inference.

Production Safety score: 100%. Zero PHI exposure events.


Install and get started

The ARGUS SDK — G-ARVIS scoring, ASF/ERR/CPCS calculators, PHITokenizer base class, and ARGUSGuard correction loop — is open-core and on PyPI.

pip install argus-ai
Enter fullscreen mode Exit fullscreen mode
from argus_ai import ARGUSGuard, GARVISScorer, PHITokenizer
from argus_ai.metrics import ASFCalculator, ERRTracker, CPCSCalculator

# Wrap any async agent function with self-correction
guard = ARGUSGuard(
    max_attempts=3,
    target_composite=0.85,
    domain="healthcare_rcm",
    phi_safe=True
)

result = await guard.execute_with_correction(
    agent_fn=my_denial_predictor,
    task=claim_task,
    scorer=GARVISScorer()
)

print(f"Score: {result.score.composite:.1%}")
print(f"Attempts: {result.attempts}")
print(f"Escalated: {result.escalated}")
Enter fullscreen mode Exit fullscreen mode

Production results

These are from the live ARIA system, 24-hour rolling average:

Metric Value
G-ARVIS composite 93.9%
Groundedness 96.2%
Accuracy 94.8%
Reliability 93.1%
Variance 91.7%
Inference Cost 95.3%
Safety 100%
Action Sequence Fidelity 91.4%
Error Recovery Rate 87.3%
Cost Per Correct Sequence $0.023
Denial rate reduction 38%

What is open vs proprietary

Open (argus-ai on PyPI + GitHub):

  • ARGUSGuard correction loop
  • GARVISScorer base framework
  • PHITokenizer base class
  • ASF, ERR, CPCS calculators
  • PulseFlow MLOps pipeline

Proprietary (the ARIA product):

  • 11-agent supervisor hierarchy with RCM domain specialization
  • Payer policy RAG with live contract updates
  • Predictive denial scoring model
  • RCM domain knowledge engine
  • Multi-tenant deployment infrastructure

Links

  • GitHub: github.com/anilatambharii/argus-ai
  • PyPI: pypi.org/project/argus-ai
  • Platform: ambharii.com/RCM
  • Full architecture article: medium.com/p/9d0c9f8d662a
  • Questions or contributions: anil@ambharii.com

If you are building agentic systems in regulated industries and running into the same observability and reliability problems — I would genuinely like to hear from you. The metrics definitions are public. Use them, improve them, tell me what is wrong with them.


Anil Prasad — Founder, Ambharii Labs · Head of Engineering & Product, Duke Energy · Top 100 AI Leaders USA 2024

#HumanWritten #ExpertiseFromField

Top comments (0)