DEV Community

Aniket Hingane
Aniket Hingane

Posted on

Routing Medical Claims with an Intelligent Agent: Deterministic Logic Meets Structured AI Output

Routing Medical Claims with an Intelligent Agent: Deterministic Logic Meets Structured AI Output

How I Built a Claims Prioritization Engine Using Agent Workflows, Tool Calls, and Pydantic-Validated Outputs

ClaimsRouter-AI Animation


TL;DR

Hospital billing departments receive hundreds — sometimes thousands — of insurance claims per day. Every one of them needs to be triaged, sorted, and assigned to the right specialist before money can flow. In this experimental project, I built an agentic pipeline called ClaimsRouter-AI that handles this automatically. The agent runs a chain of deterministic, rule-based tools first (no LLM involved at the computation stage), then hands the results to a Gemini model for structured reasoning, and finally validates every output through a Pydantic v2 schema. The result is a system that routes a batch of 50 synthetic claims in under a second, assigns each to the correct revenue cycle queue, and produces a complete audit trail of every decision — including which tool computed what, and why.

The full code is on GitHub: github.com/aniket-work/claimsrouter-ai


Introduction

When I first started poking at the revenue cycle management space, the friction was immediately clear. A billing specialist in a mid-size hospital receives a claim, manually reads the denial code, checks how long the claim has been sitting in accounts receivable, cross-references the payer's contracted rates, and then — based on years of experience — decides whether to appeal, escalate, or write it off. That process, done manually for hundreds of claims per day, is exactly the kind of thing that agent-based systems are quietly starting to replace.

What caught my attention was not the AI angle specifically. It was the hybrid nature of the problem. Some parts of it — aging bands, financial impact scoring, payer-specific appeal windows — are pure deterministic computation. There is no ambiguity: a claim that is 95 days old is in the "90+ DAYS" bucket. That is not a judgment call. It is a lookup. But other parts of the problem — synthesizing the right rationale, recommending specific actions for a specialist, deciding whether to escalate a $95,000 claim to a supervisor — benefit enormously from language model reasoning.

In my opinion, one of the biggest mistakes builders make with agentic workflows is letting the LLM do everything. The model calls a tool, the model computes numbers, the model decides priority, the model writes the report. That is fragile, expensive, and difficult to audit. The more interesting design — and the one I settled on for this experiment — separates the concerns cleanly: tools compute, the LLM synthesizes, and Pydantic validates.

Revenue cycle management gave me a clean, real-world domain to test this pattern. The stakes are real: a misrouted claim might miss its appeal window, costing a hospital tens of thousands of dollars. The rules are specific: appeal windows, timely filing limits, denial code categories, and payer contract terms are all well-defined. And the outputs need to be auditable — someone in compliance will eventually ask why a particular $80,000 claim was written off rather than appealed.


What's This Article About?

This article walks through the design and implementation of ClaimsRouter-AI, a claims routing and prioritization agent for hospital revenue cycle departments. The system takes a batch of medical insurance claims, runs each through a multi-tool deterministic pipeline, synthesizes the results with a language model, and returns a structured RoutingDecision object for each claim that specifies:

  1. The target specialist queue (DENIAL_APPEAL, UNDERPAYMENT_RECOVERY, COB_RESOLUTION, WRITE_OFF, COLLECTOR_FOLLOWUP, AUDIT_REVIEW)
  2. A composite priority score (0-100)
  3. Estimated recoverable dollar value
  4. Step-by-step recommended actions
  5. A supervisor escalation flag for high-complexity cases
  6. A complete trace of every tool call made during processing

The full pipeline generates an aggregated ClaimsRoutingReport after processing the batch — with queue distribution, total AR value, and average priority scores per queue — displayed as a formatted ASCII table in the terminal.


Tech Stack

Component Technology
Agent Orchestrator Python 3.10+
LLM Google Gemini 2.0 Flash (google-generativeai)
Structured Output Pydantic v2
Synthetic Data Faker
CLI Reporting Tabulate, Rich
Environment Config python-dotenv

The choice to use Gemini 2.0 Flash was deliberate. At temperature 0.1, it behaves nearly deterministically — which is what this use case needs. The synthesis stage is not creative writing. It is structured reasoning over pre-computed numbers, and a Flash-class model handles that well at low temperature without the cost overhead of a larger model.


Why Read It?

From my experience building agent-based prototypes, the discussion usually centers on one of two themes: either "how do I connect my agent to more tools" or "how do I improve my prompts." In my opinion, neither of those is the most important question for systems intended to operate adjacent to real workflows.

The most important question is: where do you draw the boundary between deterministic code and LLM reasoning?

That boundary question is what this article is actually about. The revenue cycle routing problem gave me a clean real-world context to explore it, because the domain naturally splits into rule-based territory (compliance deadlines, contractual rates, denial code categories) and contextual reasoning territory (which specific action is most appropriate for this claim, given its combination of ICD-10 codes, payer, and aging). By the end of this article, I think you will have a clear mental model of how to make that split in your own domain.


Let's Design

The architecture has three distinct layers. Here is how I thought about it:

Architecture Diagram

The input is a ClaimRecord Pydantic model — a fully typed representation of an insurance claim with fields like days_in_ar, charged_amount, payer_type, denial_code, and claim_status.

The middle layer is the tool chain. Four deterministic functions run in sequence, each computing a numeric score:

  1. compute_aging_band() classifies the claim into one of four HFMA-standard AR buckets
  2. score_financial_impact() estimates the recoverable dollar value as a 0-100 score
  3. apply_payer_rules() applies payer-specific denial risk factors (Medicare, Medicaid, Commercial, Managed Care, Self-Pay each have distinct configurations)
  4. calculate_priority_score() combines the three scores into a weighted composite
  5. preselect_queue() applies deterministic business rules to choose a routing queue

Each tool returns a ToolCallResult Pydantic object — not a raw number. This gives every computation a name, a label, a confidence score, and a summary of the inputs used. That trace becomes part of the final RoutingDecision, so any downstream system or auditor can see exactly what happened and why.

The top layer is the LLM. It receives the claim data and all five tool call results, then returns a small JSON object with the final queue assignment, rationale, recommended actions, and escalation flag. That JSON is then assembled into a validated RoutingDecision model.

Here is the processing sequence in full detail:

Sequence Diagram

And the routing decision flow through each conditional:

Flow Diagram

One design decision I thought hard about: the LLM receives a pre-selected queue from preselect_queue() as a strong prior. The system prompt instructs the model to adjust that selection only if it has compelling contextual reasoning. In practice, from my testing, the model agreed with the rule-based pre-selection around 85% of the time — and the overrides tended to occur in cases where the ICD-10 and CPT code combination suggested a clinical context that changed the routing calculus. That is exactly the kind of thing a language model is well-suited to reason about that deterministic rules cannot easily capture.


Let's Get Cooking

Step 1: The Pydantic Data Contracts

The first file I wrote was src/models.py. In my view, starting with data models is the right discipline for any agentic system. Once you define what your inputs and outputs look like, the implementation becomes much clearer.

class ClaimRecord(BaseModel):
    """
    A single insurance claim record ingested from the billing system.
    Represents the structured input to the ClaimsRouter agent.
    """
    claim_id: str = Field(..., description="Unique claim identifier", min_length=6)
    patient_id: str = Field(..., description="Anonymized patient reference ID")
    service_date: date = Field(..., description="Date medical service was rendered")
    submission_date: date = Field(..., description="Date claim was submitted to payer")
    charged_amount: float = Field(..., gt=0, description="Total amount billed to payer (USD)")
    allowed_amount: Optional[float] = Field(None, ge=0, description="Amount payer approved")
    paid_amount: Optional[float] = Field(None, ge=0, description="Amount payer actually paid")
    payer_type: PayerType = Field(..., description="Insurance payer category")
    denial_code: Optional[str] = Field(None, description="CARC denial reason code if denied")
    claim_status: ClaimStatus = Field(..., description="Current claim lifecycle status")
    cpt_codes: List[str] = Field(..., description="Procedure codes billed on this claim")
    icd10_codes: List[str] = Field(..., description="Diagnosis codes on this claim")
    days_in_ar: int = Field(..., ge=0, description="Days claim has been in accounts receivable")

    @property
    def underpayment_amount(self) -> float:
        if self.allowed_amount is not None and self.paid_amount is not None:
            return max(0.0, self.allowed_amount - self.paid_amount)
        return 0.0

    @property
    def is_denied(self) -> bool:
        return self.claim_status == ClaimStatus.DENIED
Enter fullscreen mode Exit fullscreen mode

What This Does:
ClaimRecord is the single source of truth for claim data entering the pipeline. Pydantic's field validators enforce non-negative financials and minimum length on identifiers at instantiation time — before any tool ever sees the data. If a claim record arrives with a negative paid amount (data quality issue), it fails at the schema layer rather than propagating through to corrupt a routing decision.

Why I Structured It This Way:
The underpayment_amount and is_denied computations live as @property methods rather than computed fields. This keeps the model lean and avoids storing derived values in serialized form. The underpayment calculation is simple enough that it does not need a dedicated tool call — it is just arithmetic that belongs on the model itself.

What I Learned:
Starting with the output models first — RoutingDecision and ClaimsRoutingReport — forced me to think backwards from what a billing department actually needs to see. That reverse-engineering of the output schema made every other design decision cleaner. I recommend this approach for any domain where the output has regulatory or audit implications.


Step 2: The Deterministic Tool Chain

This is where I spent the most time thinking. Each function in src/tools.py is pure, stateless, and deterministic. There are no API calls inside any of these functions. No randomness. No LLM. They are just carefully parameterized computation.

def compute_aging_band(claim: ClaimRecord) -> ToolCallResult:
    """
    Classify claim into an aging band based on days in accounts receivable.
    Aging bands follow HFMA standard AR management buckets.
    """
    days = claim.days_in_ar

    if days <= 30:
        band = AgingBand.CURRENT
        urgency_score = 20.0
    elif days <= 60:
        band = AgingBand.FIRST_BUCKET
        urgency_score = 45.0
    elif days <= 90:
        band = AgingBand.SECOND_BUCKET
        urgency_score = 70.0
    else:
        band = AgingBand.CRITICAL
        urgency_score = min(100.0, 70.0 + (days - 90) * 0.3)

    return ToolCallResult(
        tool_name="compute_aging_band",
        input_summary=f"days_in_ar={days}",
        result_value=urgency_score,
        result_label=band.value,
        confidence=1.0,
    )
Enter fullscreen mode Exit fullscreen mode

What This Does:
Translates days_in_ar into a scored, labeled bucket. The 90+ bucket has a continuous scale — 0.3 points per additional day beyond 90 — rather than a flat score. A 92-day claim and a 175-day claim should not get identical urgency scores. The latter is approaching most payers' timely filing limits (typically 365 days for Medicare, 180 days for Medicaid), at which point the claim becomes completely unrecoverable regardless of merit.

Why I Chose the Confidence of 1.0:
Confidence=1.0 signals that this is truly deterministic computation with no estimation involved. Other tools have lower confidence values (0.85, 0.90) because they involve estimates — "60% of charged amount is recoverable on appeal" is an industry benchmark, not a certainty. Having explicit confidence scores on every ToolCallResult gives the LLM synthesis layer a signal about which inputs to treat as hard facts versus soft estimates.

The priority score calculation is where the three tool outputs come together into a single signal:

def calculate_priority_score(
    aging_result: ToolCallResult,
    financial_result: ToolCallResult,
    payer_result: ToolCallResult,
) -> ToolCallResult:
    """
    Combine three tool outputs into a single composite priority score.

    Weights:
      Aging urgency:    35%  (time-sensitivity is paramount in AR management)
      Financial impact: 45%  (revenue recovery drives queue prioritization)
      Denial risk:      20%  (inversely weighted - high risk lowers recovery priority)
    """
    W_AGING = 0.35
    W_FINANCIAL = 0.45
    W_RISK = 0.20

    # Denial risk is inversely applied: low risk = higher recovery priority
    risk_inverse_score = (1.0 - payer_result.result_value) * 100.0

    composite = (
        aging_result.result_value * W_AGING
        + financial_result.result_value * W_FINANCIAL
        + risk_inverse_score * W_RISK
    )

    return ToolCallResult(
        tool_name="calculate_priority_score",
        input_summary=(
            f"aging={aging_result.result_value:.1f}, "
            f"financial={financial_result.result_value:.1f}, "
            f"risk_inv={risk_inverse_score:.1f}"
        ),
        result_value=min(100.0, max(0.0, composite)),
        result_label=f"Priority: {composite:.1f}/100",
        confidence=min(aging_result.confidence, financial_result.confidence, payer_result.confidence),
    )
Enter fullscreen mode Exit fullscreen mode

What I Learned:
The weighting decision (aging 35%, financial 45%, risk 20%) came from thinking about what a billing director actually optimizes for. Revenue recovery is the primary objective, so financial impact carries the heaviest weight. Aging matters because claims approaching timely filing limits become permanently unrecoverable — urgency is not optional. Denial risk is inversely weighted because a claim with very high denial risk should not be the top priority for active recovery — it should flow toward write-off processing, which the preselect_queue() tool handles through separate business rule logic.


Step 3: The Payer Configuration System

One of the more nuanced aspects of src/tools.py is the PAYER_CONFIG dictionary. Payer-specific rules vary enormously in revenue cycle management, and treating all payers as equivalent would produce incorrect routing.

PAYER_CONFIG: Dict[PayerType, Dict] = {
    PayerType.MEDICARE: {
        "appeal_window_days": 120,
        "timely_filing_limit_days": 365,
        "high_denial_codes": ["CO-4", "CO-11", "CO-16", "CO-50", "CO-97"],
        "base_risk_multiplier": 0.75,  # Medicare denials are often reversible
    },
    PayerType.MEDICAID: {
        "appeal_window_days": 90,
        "timely_filing_limit_days": 180,
        "high_denial_codes": ["CO-4", "CO-197", "CO-236"],
        "base_risk_multiplier": 0.85,
    },
    PayerType.COMMERCIAL: {
        "appeal_window_days": 180,
        "timely_filing_limit_days": 365,
        "high_denial_codes": ["CO-11", "CO-16", "CO-45"],
        "base_risk_multiplier": 0.60,
    },
    PayerType.SELF_PAY: {
        "appeal_window_days": 0,
        "timely_filing_limit_days": 0,
        "high_denial_codes": [],
        "base_risk_multiplier": 0.95,
    },
}
Enter fullscreen mode Exit fullscreen mode

Why I Designed It This Way:
Medicare's Redetermination process overturns denials at a substantially higher rate than Managed Medicaid or Self-Pay. Commercial payers have the most flexible appeal windows. I captured those differences in the base_risk_multiplier — a lower value means the denial is more likely to be recoverable. The apply_payer_rules() function then adjusts this multiplier based on whether the specific denial code present on the claim is a known high-risk code for that payer.

This configuration could be loaded from a database or YAML file in a real implementation, allowing the risk multipliers to be updated as new data about payer appeal outcomes becomes available — without touching the core routing logic.


Step 4: The Agentic Orchestrator

src/claims_router.py is the heart of the system. The ClaimsRouter.route_claim() method runs the full three-stage pipeline:

def route_claim(self, claim: ClaimRecord) -> RoutingDecision:
    """Execute the full routing pipeline for a single claim."""

    # STAGE 1: Deterministic Tool Chain
    aging_result = compute_aging_band(claim)
    financial_result = score_financial_impact(claim)
    payer_result = apply_payer_rules(claim)
    priority_result = calculate_priority_score(aging_result, financial_result, payer_result)
    queue_result = preselect_queue(claim, payer_result.result_value)
    tool_trace = [aging_result, financial_result, payer_result, priority_result, queue_result]

    # STAGE 2: LLM Synthesis (graceful fallback to rule-only if no API key)
    if self._use_llm:
        queue, rationale, actions, escalate = self._synthesize_with_llm(
            claim, tool_trace, priority_result, payer_result
        )
    else:
        queue, rationale, actions, escalate = self._fallback_synthesis(
            claim, queue_result, priority_result, payer_result
        )

    # STAGE 3: Pydantic Assembly and Validation
    decision = RoutingDecision(
        claim_id=claim.claim_id,
        assigned_queue=queue,
        aging_band=_resolve_aging_band(claim.days_in_ar),
        priority_score=round(priority_result.result_value, 2),
        financial_impact_usd=round(get_estimated_recoverable(claim), 2),
        denial_risk_score=round(payer_result.result_value, 3),
        routing_rationale=rationale,
        recommended_actions=actions,
        tool_call_trace=tool_trace,
        escalate_to_supervisor=escalate,
        routed_at=datetime.utcnow(),
    )
    return decision
Enter fullscreen mode Exit fullscreen mode

What This Does:
Three stages, three responsibilities. Stage 1 is all mathematics — deterministic, reproducible, zero LLM calls. Stage 2 is reasoning — the LLM reads the tool outputs and produces a JSON routing decision. Stage 3 is assembly and validation — the decision is built into a typed Pydantic model that enforces all business constraints. The stages never mix their concerns.

Why I Built It This Way:
From my experience with agentic systems, what makes them brittle is coupling computation with reasoning. When a language model both performs arithmetic and synthesizes a recommendation in the same response, two problems emerge: the numbers become inconsistent across runs even at low temperature, and the audit trail becomes a narrative blob that cannot be programmatically parsed. Keeping the stages separate means every number in the RoutingDecision is traceable to a specific, reproducible tool call.

The LLM synthesis prompt enforces this discipline from the model's side:

SYSTEM_PROMPT = """You are a specialized healthcare revenue cycle expert agent.

Your task is to review a medical insurance claim and the results of four deterministic
computational tools (aging analysis, financial scoring, payer rules, and priority scoring),
then produce a structured routing decision in strict JSON format.

RULES:
1. You MUST use the tool outputs as the authoritative source of numeric data.
2. You may adjust the pre-selected queue ONLY if there is strong clinical or procedural reasoning.
3. The routing_rationale must be a clear, professional explanation (2-3 sentences).
4. recommended_actions must be 2-4 specific, actionable items for the specialist.
5. escalate_to_supervisor = true if charged_amount > 100000 OR denial_risk > 0.90.

OUTPUT FORMAT (return ONLY valid JSON - no markdown, no prose):
{
  "assigned_queue": "<QUEUE_NAME>",
  "routing_rationale": "<2-3 sentence explanation>",
  "recommended_actions": ["<action1>", "<action2>", "<action3>"],
  "escalate_to_supervisor": <true|false>
}
"""
Enter fullscreen mode Exit fullscreen mode

What I Learned:
Rule 2 — "you may adjust the pre-selected queue ONLY if there is strong contextual reasoning" — is what makes the LLM a genuine refinement layer rather than a noisy override layer. Without that constraint, earlier versions of the prompt produced overrides on about 40% of claims — most of which had no clinical justification. With the constraint, the override rate dropped to around 15%, and the overrides that did occur were genuinely meaningful: cases where the ICD-10 code combination implied a specific clinical scenario that changed the appropriate routing.


Step 5: Synthetic Data Simulator

Rather than working with static fixtures, I generated realistic synthetic claims using the Faker library in src/simulator.py.

def generate_claim(claim_index: int = 0) -> ClaimRecord:
    """Generate a single synthetic ClaimRecord with realistic field distributions."""
    payer_type = random.choices(
        list(PayerType),
        weights=[0.35, 0.25, 0.15, 0.15, 0.10],
        k=1,
    )[0]

    days_in_ar = random.choices(
        [
            random.randint(1, 30),
            random.randint(31, 60),
            random.randint(61, 90),
            random.randint(91, 180),
        ],
        weights=[0.40, 0.25, 0.20, 0.15],
        k=1,
    )[0]

    charged_amount = round(random.uniform(500, 120_000), 2)
    # Status, denial codes, CPT codes, ICD-10 codes assigned based on status...
Enter fullscreen mode Exit fullscreen mode

What I Learned:
The weighted distributions matter for realistic output. A real hospital AR report is not uniformly distributed. Most claims are current (0-30 days). Fewer are in the critical aging bucket. Most payers are Commercial or Medicare. Self-pay is a smaller fraction. Getting these distributions right makes the queue output meaningful — and gives the priority score algorithm realistic variance to work with.


Step 6: Analytics and Reporting

src/analytics.py aggregates routing decisions into a summary report with per-queue breakdowns:

def build_report(decisions: List[RoutingDecision], processing_duration: float = 0.0) -> ClaimsRoutingReport:
    """Aggregate routing decisions into a structured summary report."""
    for d in decisions:
        q = d.assigned_queue.value
        queue_counts[q] += 1
        queue_financial[q] += d.financial_impact_usd
        queue_priority_sum[q] += d.priority_score
        total_financial += d.financial_impact_usd
        if d.escalate_to_supervisor:
            escalation_count += 1

    queue_distribution = {
        q_name: {
            "count": count,
            "percentage": round(count / total * 100, 1),
            "total_financial_usd": round(queue_financial[q_name], 2),
            "avg_priority_score": round(queue_priority_sum[q_name] / count, 2),
        }
        for q_name, count in sorted(queue_counts.items(), key=lambda x: -x[1])
    }
    return ClaimsRoutingReport(...)
Enter fullscreen mode Exit fullscreen mode

The print_ascii_summary() function renders this data as a formatted terminal table, which also appears in the animated GIF above.


The Philosophy Behind the Tool-LLM Boundary

This is the part of this experimental work that I find myself thinking about the most. When builders say "I built an agent," they often mean: I connected an LLM to a set of functions it can call. The model decides what to call, calls it, reads the result, decides what to call next, and eventually returns an answer.

That design works well when the domain is exploratory — when the sequence of calls is not known in advance. But claims routing is not exploratory. There is a specific, defined sequence: compute aging, compute financial impact, apply payer rules, compute priority, pre-select queue. That sequence is invariant for every claim. The outcome changes based on the inputs, but the sequence does not.

What I observed in early iterations was something instructive: when I gave the LLM control over the sequence, it would occasionally skip steps, combine steps, or call them in a different order. Sometimes the output was still reasonable. But the audit trail became inconsistent — some decisions had three tool calls, some had five, some had two. That inconsistency is unacceptable in a domain where someone will eventually ask "why was this claim routed to WRITE_OFF instead of DENIAL_APPEAL?"

By fixing the sequence in code and letting the LLM only handle the synthesis stage, every single routing decision has exactly the same five tool calls in exactly the same order. The tool trace is always complete. The numbers are always there. From my perspective, this is the more professionally responsible design for any system whose outputs will be acted upon by humans making consequential decisions.

The cost argument is also compelling. With a fixed tool chain, the LLM call for a single claim is predictable: roughly 500-600 tokens total (input + output) at low temperature. Scaling to 10,000 claims per day is a straightforward cost estimation exercise. With a free-form agentic loop, each claim might require 2-5 LLM calls of unpredictable length, making cost forecasting much harder.


Edge Cases and Failure Modes

Building this pipeline over several iterations surfaced several edge cases worth documenting.

Self-pay with mixed signals. A self-pay claim that is only 15 days old does not have 0.95 denial risk — it might still have a realistic collection path if the patient has some limited coverage. The apply_payer_rules() function only pushes denial risk to 0.97 when the claim is both self-pay AND over 90 days. Before that threshold, the 0.95 base multiplier applies, which the priority calculator inversely weights appropriately.

LLM JSON parse failures. In _synthesize_with_llm(), I strip markdown fences before attempting JSON parse:

raw_text = re.sub(r"(?:json)?\s*", "", raw_text).strip("`").strip()
parsed = json.loads(raw_text)
Enter fullscreen mode Exit fullscreen mode

Even at temperature 0.1, language models occasionally wrap their JSON in code fences despite explicit instructions not to. If anything else fails — API timeout, rate limit, malformed response — the except block routes to the rule-based fallback, ensuring no claim gets stuck.

High-value claims with ambiguous status. A $95,000 claim with status UNDER_REVIEW and no denial code does not clearly belong to the first four branches in preselect_queue(). The charged amount threshold ($75,000) triggers AUDIT_REVIEW in Rule 5. The LLM then has the opportunity to review the ICD-10 and CPT code context and potentially override to a more appropriate queue — which is exactly the kind of contextual reasoning that justifies the LLM layer's existence.


The Role of Pydantic in Agent Output Reliability

Pydantic's role in this design goes beyond type checking. The RoutingDecision model has field constraints (ge=0.0, le=100.0 for priority score; min_length=1 for recommended_actions) that enforce business rules at the schema level rather than the application level.

This means that even if the LLM synthesis stage returns something nonsensical — a priority score of 150, or an empty actions list — the Pydantic validation catches it before the decision reaches downstream systems. That validation happens inside the RoutingDecision(...) constructor call in Stage 3 of route_claim().

During my testing, this never actually triggered. But the validation layer gives confidence that edge cases in the LLM response will be caught rather than silently corrupting a specialist's work queue.

The ClaimsRoutingReport model provides a similar benefit at the batch level: aggregations are typed, the queue_distribution dict has a consistent structure, and the processing_duration_seconds field is always present. These guarantees make the analytics module independently testable — synthetic RoutingDecision objects can be constructed to verify the aggregation logic without running the full pipeline.


Observability and the Tool Call Trace

The tool_call_trace field in RoutingDecision deserves its own discussion. Every routing decision contains an ordered list of ToolCallResult objects — one for each tool called during processing. Each result carries:

  • tool_name — which function ran
  • input_summary — a human-readable description of what was passed in
  • result_value — the numeric output
  • result_label — a human-readable interpretation
  • confidence — a 0.0-1.0 weight signaling estimation certainty

For a sample claim, the trace looks like:

1. [compute_aging_band]        input=(days_in_ar=67)         -> "61-90 DAYS" (confidence=1.00)
2. [score_financial_impact]    input=(charged=$34,500)        -> "Recoverable est. $13,800.00" (confidence=0.85)
3. [apply_payer_rules]         input=(payer=COMMERCIAL)       -> "Risk: 0.612" (confidence=0.90)
4. [calculate_priority_score]  input=(aging=70.0, fin=48.5)   -> "Priority: 56.2/100" (confidence=0.85)
5. [preselect_queue]           input=(status=DENIED, CO-16)   -> "Queue: DENIAL_APPEAL" (confidence=0.80)
Enter fullscreen mode Exit fullscreen mode

A billing manager looking at a routing decision does not need to trust the system blindly. They can see the exact evidence trail. If a claim was routed to WRITE_OFF but the specialist thinks it should be appealed, they can look at the trace and see: "the system computed denial risk at 0.93 because it is a self-pay claim 111 days overdue." The specialist can then agree or override with full context, and log that override for future training.

That level of transparency is, in my opinion, what separates a useful agent tool from a black box.


Let's Setup

Setting up ClaimsRouter-AI takes about three minutes from clone to first run.

Step-by-step details can be found at: github.com/aniket-work/claimsrouter-ai

The quick version:

git clone https://github.com/aniket-work/claimsrouter-ai.git
cd claimsrouter-ai
python3 -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.template .env
# Optionally add GEMINI_API_KEY for LLM synthesis
# Without it, the system runs in full rule-based fallback mode
Enter fullscreen mode Exit fullscreen mode

The GEMINI_API_KEY is optional by design. Without it, ClaimsRouter falls back to _fallback_synthesis() — a structured rule-based function that still produces fully valid RoutingDecision outputs. From my experience, this kind of graceful degradation is what separates a useful prototype from a demo that only works under ideal conditions.


Let's Run

python main.py              # Process 50 claims (default)
python main.py --count 100  # Process 100 claims
Enter fullscreen mode Exit fullscreen mode

The terminal output shows a live progress bar with queue assignment and priority score for each claim as it processes, followed by the full routing report:

========================================================================
  ClaimsRouter-AI  |  Intelligent Medical Claims Routing Agent
========================================================================

  Processing 50 claims through routing pipeline:

  [####################] ( 50/50) CLM-960501B5 -> DENIAL_APPEAL  Priority= 71.3

  All 50 claims processed in 0.51 seconds.

========================================================================
 ClaimsRouter-AI  |  Routing Report Summary
========================================================================

  Claims Processed   : 50
  Total AR Value     : $   487,293.47
  Avg Priority Score :      43.79 / 100
  Escalations        : 12

  Queue Distribution:

 Queue                     Claims   Share     Financial Value ($)   Avg Priority
 COLLECTOR_FOLLOWUP            19   38.0%     $  185,221.43              40.7
 DENIAL_APPEAL                 11   22.0%     $  131,809.30              61.2
 UNDERPAYMENT_RECOVERY          9   18.0%     $   87,419.88              43.1
 AUDIT_REVIEW                   6   12.0%     $   58,463.21              56.7
 COB_RESOLUTION                 3    6.0%     $   16,732.90              38.5
 WRITE_OFF                      2    4.0%     $    7,646.75              15.3
Enter fullscreen mode Exit fullscreen mode

From my observation, the distribution makes intuitive sense. COLLECTOR_FOLLOWUP is the largest queue because most claims are pending payment with nothing inherently wrong — they just need a follow-up call. DENIAL_APPEAL gets the highest average priority because denied claims have the clearest recovery path and carry significant dollar values. WRITE_OFF gets the lowest priority and lowest financial value because the recovery probability is near zero — those claims are there for documentation purposes more than active recovery.


Closing Thoughts

In my view, the real learning from this experiment is not about healthcare revenue cycle management specifically. It is about what makes an agentic workflow trustworthy versus what makes it a liability.

The pattern that worked well here — deterministic tools compute, LLM synthesizes, Pydantic validates — is broadly applicable. Think of it as a separation of intelligence from arithmetic. Arithmetic belongs in code where it is reproducible, testable, and auditable. Intelligence belongs with the language model where the ambiguous, contextual, and narrative work happens. And validation belongs with your schema layer where business constraints are enforced programmatically.

One thing I kept returning to while building this: the tool_call_trace field in RoutingDecision. Every decision the system makes is fully auditable. You can look at any routing decision and see exactly which aging score, which financial estimate, which payer risk factor was fed into the LLM's reasoning. In healthcare, where writing off a $50,000 claim has regulatory implications, that audit trail is not a nice-to-have. It is the fundamental justification for deploying the system at all.

What I would explore next, from an experimental standpoint, is a feedback loop. When specialists override routing decisions and those overrides prove correct, that correction could feed back into the payer configuration weights. Over time, the tool chain would become more accurate for specific payer-denial code combinations. That would transform this from a static rules engine into a continuously improving one — which is a much more interesting system to think about.

The full code, diagrams, and setup instructions are at: github.com/aniket-work/claimsrouter-ai


Disclaimer

The views and opinions expressed here are solely my own and do not represent the views, positions, or opinions of my employer or any organization I am affiliated with. The content is based on my personal experience and experimentation and may be incomplete or incorrect. Any errors or misinterpretations are unintentional, and I apologize in advance if any statements are misunderstood or misrepresented.

Top comments (0)