Aniket Hingane

Posted on Mar 12

Building a Risk-Aware AI Agent with Internal Critic, Self-Consistency Reasoning, and Uncertainty Estimation

#machinelearning #ai #agents #python

How I Automated Insurance Claims Risk Analysis Using Multi-Agent Patterns

TL;DR

Applying Generative AI to high-stakes sectors like insurance or finance requires more than a simple API call. I built an autonomous risk evaluation agent that processes ambiguous auto insurance claims by using three distinct resilience patterns: Self-Consistency Reasoning (generating multiple thought paths), an Internal Critic (a separate, adversarial review step), and Uncertainty Estimation (mathematically penalizing confidence based on disagreement and critique severity). By stringing these together, I reduced false-positive approvals of complex claims by designing the system to automatically flag out-of-bounds uncertainty for human review.

Code for this experiment is available on my GitHub.

Introduction

I’ve observed that many early implementations of AI agents fail spectacularly when exposed to the real world. Why? Because they operate with blind confidence. An LLM instructed to "approve or deny this insurance claim" will usually do exactly that, even if the data provided is highly ambiguous, contradictory, or statistically anomalous.

In my opinion, if we want agents to handle sensitive business tasks—like adjudicating a $12,500 auto collision claim—we need to design them with self-doubt. They must be capable of realizing when a decision is too close to call and handing it off to a human expert.

I wrote this article to demonstrate a PoC (Proof of Concept) I built that tackles this exact problem. I incorporated a multi-step orchestration pipeline leveraging self-consistency, adversarial criticism, and uncertainty quantification.

What's This Article About?

This article breaks down a Python-based autonomous agent I designed to evaluate insurance claims. We will explore how to move beyond basic zero-shot prompting by implementing:

Self-Consistency Reasoning: Generating multiple parallel assessments focusing on different aspects of the same claim.
Consensus Synthesis: Analyzing the divergence across those parallel tracks to reach a majority decision.
The Internal Critic: An adversarial review step that explicitly hunts for flaws or blind spots in the consensus.
Uncertainty Estimation: A final calculation that dynamically lowers confidence limits and overrides automatic approvals if risk thresholds are breached.

Tech Stack

From my experience, keeping the orchestration layer lightweight is crucial for experimentation.

Python 3.12+: The core execution environment.
Pydantic: For strict data structuring (although omitted in this simplified code version for clarity).
Mermaid.js: For generating the architectural flow diagrams seen in this post.
Raw Output / Simulated LLM: For the purposes of this PoC and code clarity, the reasoning paths are simulated using dynamic Python calculations, though the architecture perfectly mimics API calls to models like GPT-4 or Claude.

Why Read It?

If you are an engineer tasked with integrating AI into business-critical paths, you simply cannot afford hallucinations or unfounded confidence. As per me, the value of an AI agent is not just in its ability to automate, but in its ability to know when not to automate. This article provides a highly practical blueprint for structuring robust agentic workflows that inherently prioritize safety, compliance, and risk management.

Let's Design

Before diving into the code, I thought it was essential to visualize the flow. Here is the System Architecture diagram I generated:

The logic operates in a distinct pipeline. Data arrives and immediately diverges into multiple separate reasoning agents. These agents do not communicate with each other; they independently evaluate the claim based on randomized focus angles (e.g., medical history vs. reporting timelines).

Once they return their results, a Synthesizer tallies the "votes." This consensus is then passed to a completely separate layer: the Critic. The Critic's only job is to break the consensus. It adversarialy checks the claim data against the decision.

Finally, the Uncertainty Estimator weighs the critic's findings against the initial consensus agreement percentage to produce a final confidence score. If this score is below a certain threshold, the system halts and escalates the claim.

Let’s Get Cooking

Let's look at the core implementation details. In my Python project, I separated the workflow into dedicated classes representing each agent role.

1. Generating Reasoning Paths

First, we need to generate variance. In a real LLM scenario, you might achieve this by increasing the temperature parameter or passing slightly different system prompts (e.g., "Analyze this claim as a medical expert" vs. "Analyze this as a fraud investigator").

import random
from typing import List, Dict, Any

class AgentThoughts:
    """Simulates generating multiple reasoning paths (Self-Consistency)"""
    @staticmethod
    def generate_initial_assessments(claim_data: Dict[str, Any], num_paths: int = 3) -> List[str]:
        assessments = []
        base_factors = ["medical_history", "claim_amount", "time_of_reporting", "provider_reputation"]

        for i in range(num_paths):
            focus = random.choice(base_factors)
            risk_score = random.uniform(0.2, 0.8)

            # Simple bias injection for our simulation
            if claim_data.get("time_of_incident") == "late_night" and focus == "time_of_reporting":
                risk_score += 0.2

            assessment = f"Path {i+1}: Focusing on {focus}. Calculated Risk Score: {risk_score:.2f}. "
            if risk_score > 0.6:
                assessment += "Recommendation: FLAG FOR MANUAL REVIEW."
            else:
                assessment += "Recommendation: AUTO-APPROVE."
            assessments.append(assessment)

        return assessments

What I learned here: By forcing the system to explicitly declare its "focus," we make the variance traceable. If the model later flags a claim, we can distinctly point to which reasoning path triggered the flag.

2. Synthesizing Consensus

Next, I put together the Synthesizer. It acts as the "manager" of the parallel tracks.

class Synthesizer:
    """Synthesizes the generated reasoning paths into a consensus"""
    @staticmethod
    def synthesize_consensus(assessments: List[str]) -> Dict[str, Any]:
        flag_count = sum(1 for a in assessments if "FLAG" in a)
        approve_count = sum(1 for a in assessments if "AUTO-APPROVE" in a)

        consensus = "FLAG FOR MANUAL REVIEW" if flag_count > approve_count else "AUTO-APPROVE"
        confidence = max(flag_count, approve_count) / len(assessments)

        return {
            "decision": consensus,
            "consistency_confidence": confidence,
            "summary": f"Consensus reached based on {max(flag_count, approve_count)}/{len(assessments)} paths agreeing."
        }

My reasoning: The consistency_confidence variable is critical. If 5 out of 5 paths say "Approve", the confidence is 1.0 (100%). If 3 say Approve and 2 say Flag, the decision is still Approve, but the confidence drops to 0.6 (60%). This numeric divergence is the foundation of our uncertainty metric.

3. The Adversarial Critic

This is where it gets interesting. I observed that consensus mechanisms can sometimes suffer from groupthink, especially if the underlying LLM has a fundamental bias toward approval. Therefore, I built an Internal Critic. This agent operates independently and reviews the previous output.

class InternalCritic:
    """Actively reviews the consensus to find flaws or missed risk factors"""
    @staticmethod
    def review(claim_data: Dict[str, Any], consensus_dict: Dict[str, Any], assessments: List[str]) -> Dict[str, Any]:
        criticisms = []
        severity = 0.0

        if consensus_dict["decision"] == "AUTO-APPROVE" and claim_data.get("claim_amount", 0) > 10000:
            criticisms.append("Critic Note: Auto-approving high-value claim ($10k+) without secondary documentation check.")
            severity += 0.6

        if consensus_dict["consistency_confidence"] < 1.0:
            criticisms.append(f"Critic Note: Disagreement in initial reasoning paths indicates underlying ambiguity.")
            severity += 0.3

        return {
            "criticisms": criticisms,
            "critic_severity": min(severity, 1.0)
        }

Why this works: The critic doesn't just evaluate the claim; it evaluates the decision. It acts as a final sanity check against hardcoded business rules (like high-value claim limits) and logical consistency.

4. Estimating Uncertainty

Finally, the Uncertainty Estimator takes the base confidence and applies the critic's severity penalty.

class UncertaintyEstimator:
    """Quantifies overall uncertainty and adjusts decision"""
    @staticmethod
    def estimate_uncertainty(consensus_dict: Dict[str, Any], critic_review: Dict[str, Any]) -> Dict[str, Any]:
        base_confidence = consensus_dict["consistency_confidence"]
        critic_penalty = critic_review["critic_severity"] * 0.5

        final_confidence = max(0.0, base_confidence - critic_penalty)
        final_decision = consensus_dict["decision"]

        # Override logic
        if final_confidence < 0.6 and final_decision == "AUTO-APPROVE":
             final_decision = "FLAG FOR MANUAL REVIEW (Overridden by low confidence)"

        return {
            "final_decision": final_decision,
            "final_confidence_score": final_confidence,
            "requires_human": final_decision.startswith("FLAG") or final_confidence < 0.7
        }

Let's Setup

If you want to run this experimental PoC yourself, I've pushed the code to my GitHub.

Clone the repository: git clone https://github.com/aniket-work/insurance-risk-assessor.git
Navigate to the directory: cd insurance-risk-assessor
Set up the Python environment: python3 -m venv venv source venv/bin/activate

Let's Run

When you execute python main.py, the orchestrator tests the pipeline against an ambiguous claim ($12,500 Auto Collision occurring late at night).

The resulting output demonstrates the uncertainty calculation perfectly:

[1] Processing Claim: CLM-2026-X99 - Amount: $12500
[2] Generating Independent Reasoning Paths (Self-Consistency)...
    - Path 1: Focusing on medical_history. Calculated Risk Score: 0.61. Recommendation: FLAG FOR MANUAL REVIEW.
    ...
[3] Synthesizing Consensus...
    Consensus Decision: FLAG FOR MANUAL REVIEW
    Consistency Agreement: 60%

[4] Engaging Internal Critic Review...
    ! CRITIQUE: Critic Note: Auto-approving high-value claim ($10k+) without secondary documentation check.
    ! CRITIQUE: Critic Note: Disagreement in initial reasoning paths indicates underlying ambiguity.

[5] Estimating Overall Uncertainty...

==================================================
FINAL RISK ASSESSMENT REPORT
==================================================
{
  "Claim ID": "CLM-2026-X99",
  "Initial Consensus": "FLAG FOR MANUAL REVIEW",
  "Final Decision": "FLAG FOR MANUAL REVIEW",
  "Confidence Score": "15.0%",
  "Requires Human Adjucator": true
}

By the end of the run, the critic's severe penalization of the conflicting initial paths and the high dollar value dragged the confidence score all the way down to a mere 15.0%. The system correctly recognized its own inability to adjudicate the claim and mandated human intervention.

Closing Thoughts

I observed that structuring AI agents with enforced skepticism fundamentally changes how reliable they become. When building tools for enterprise logic—whether it's insurance claims, financial lending, or HR screening—we cannot rely on a single, confident API response.

By demanding self-consistency, actively soliciting adversarial critiques, and mathematically computing uncertainty, we can build agents that operate safely within their bounds. In my opinion, the future of autonomous workflows isn't about making agents capable of deciding everything; it's about making them smart enough to know when to ask for help.

Disclaimer

The views and opinions expressed here are solely my own and do not represent the views, positions, or opinions of my employer or any organization I am affiliated with. The content is based on my personal experience and experimentation and may be incomplete or incorrect. Any errors or misinterpretations are unintentional, and I apologize in advance if any statements are misunderstood or misrepresented.

DEV Community