Adamo Software

Posted on Mar 23

How we designed an AI Agent workflow with fallback chains and human-in-the-loop

#ai #programming #python #devops

If you've shipped an AI agent to production, you already know the uncomfortable truth: the demo works great, but real users find every edge case your prompt didn't anticipate. We ran into this exact problem when building an internal document processing agent for a healthcare client. The agent worked fine 85% of the time. The other 15% ranged from "slightly wrong" to "confidently hallucinated a patient ID that doesn't exist."

This post walks through the fallback architecture we built to handle those failures gracefully, without turning every request into a human review bottleneck.

The problem with linear agent workflows

Our first version was straightforward: user uploads a document, the LLM extracts structured fields, validates against a schema, and writes to the database. A single chain, no branching logic.

The failure math killed us. If each step in a 5-step workflow has 90% reliability, your end-to-end success rate drops to about 59%. Add more steps, and it gets worse fast.

We needed a system where failures at any step could be caught, rerouted, and resolved without restarting the entire pipeline.

The fallback chain pattern

Instead of a single LLM call per step, we implemented a tiered fallback chain. The concept is simple: try the primary approach, and if confidence drops below a threshold, cascade to the next option.

Here's the core logic in Python:

from dataclasses import dataclass
from typing import Any, Optional
import logging

logger = logging.getLogger(__name__)

@dataclass
class AgentResult:
    output: Any
    confidence: float
    model_used: str
    fallback_triggered: bool = False

class FallbackChain:
    def __init__(self, confidence_threshold: float = 0.7):
        self.threshold = confidence_threshold
        self.chain = []

    def add_handler(self, name: str, handler, min_confidence: float = 0.0):
        self.chain.append({
            "name": name,
            "handler": handler,
            "min_confidence": min_confidence
        })
        return self

    async def execute(self, input_data: dict) -> AgentResult:
        for i, step in enumerate(self.chain):
            try:
                result = await step["handler"](input_data)

                if result.confidence >= self.threshold:
                    result.fallback_triggered = (i > 0)
                    logger.info(
                        f"Step '{step['name']}' succeeded "
                        f"(confidence: {result.confidence:.2f})"
                    )
                    return result

                logger.warning(
                    f"Step '{step['name']}' below threshold "
                    f"({result.confidence:.2f} < {self.threshold})"
                )

            except Exception as e:
                logger.error(f"Step '{step['name']}' failed: {e}")
                continue

        # All automated steps failed, escalate to human
        return AgentResult(
            output=None,
            confidence=0.0,
            model_used="human_escalation",
            fallback_triggered=True
        )

We typically set up three tiers:

Primary model (e.g., GPT-4o or Claude) with a specialized prompt. Fast, cost-effective for straightforward cases.
Enhanced model with additional context injection. We pull in RAG-retrieved examples of similar documents and few-shot them into the prompt.
Human escalation. The request lands in a review queue with full context: the original input, what each model attempted, and where confidence dropped.

# Setting up the chain for document extraction
extraction_chain = FallbackChain(confidence_threshold=0.75)

extraction_chain.add_handler(
    name="primary_extraction",
    handler=primary_llm_extract
)
extraction_chain.add_handler(
    name="enhanced_extraction_with_rag",
    handler=rag_enhanced_extract
)
extraction_chain.add_handler(
    name="human_review",
    handler=queue_for_human_review
)

Confidence scoring: the hard part

The fallback chain is useless without reliable confidence signals. LLM token probabilities alone are not enough. A model can be confidently wrong. Anthropic published a practical guide on evaluating agent outputs that covers calibration in more depth.

We use a composite confidence score built from three signals:

def compute_confidence(
    llm_output: dict,
    schema: dict,
    historical_outputs: list[dict]
) -> float:
    # 1. Schema compliance: does the output match expected types/formats?
    schema_score = validate_against_schema(llm_output, schema)

    # 2. Self-consistency: run the same input 3 times, 
    #    measure agreement across outputs
    consistency_score = measure_output_consistency(
        llm_output, historical_outputs
    )

    # 3. Field-level heuristics: known patterns for dates, IDs, codes
    heuristic_score = run_field_heuristics(llm_output)

    # Weighted combination
    return (
        0.3 * schema_score 
        + 0.4 * consistency_score 
        + 0.3 * heuristic_score
    )

Schema compliance catches obvious failures like missing required fields or wrong data types. Self-consistency catches the subtler ones. If you run the same extraction three times and get three different patient names, something is off.

The heuristic layer handles domain-specific validation. In healthcare, that means checking date formats, verifying that ICD codes match known patterns, and flagging values that fall outside clinical ranges.

Where human-in-the-loop actually fits

The biggest mistake we made early on was treating HITL as a binary switch: either the agent handles it or a human does. In practice, you need multiple levels of human involvement.

We settled on three escalation tiers:

Tier 1: Async review. The agent completed the task but confidence was borderline (0.6 to 0.75). A human reviewer sees the output alongside the original document and either approves or corrects it. This handles about 10% of requests and adds 2 to 4 hours of latency, which was acceptable for our use case.

Tier 2: Real-time intervention. Confidence dropped below 0.6, or the agent hit a known ambiguity pattern (e.g., handwritten notes, poor scan quality). The workflow pauses, and the request routes to an available specialist through a Slack notification. We used LangGraph's interrupt() pattern for this:

from langgraph.types import interrupt

def extraction_node(state: dict) -> dict:
    result = await extraction_chain.execute(state["document"])

    if result.model_used == "human_escalation":
        # Pause the workflow and wait for human input
        human_response = interrupt({
            "reason": "Low confidence extraction",
            "document_id": state["document_id"],
            "attempted_output": result.output,
            "confidence": result.confidence
        })
        return {"extracted_data": human_response}

    return {"extracted_data": result.output}

For a step-by-step walkthrough of interrupts and commands in LangGraph, this tutorial covers the basics well

Tier 3: Full manual processing. The document type is entirely outside the agent's training distribution. This happens maybe 2% of the time. The system logs the case as a training candidate for future model improvement.

Circuit breakers for cascading failures

One thing we learned the hard way: when an upstream model starts degrading (rate limits, API instability, model drift), it can poison every downstream step. A hallucinated field in step 1 becomes a corrupted database entry by step 4.

We added circuit breakers that monitor rolling error rates per step:

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, reset_timeout: int = 60):
        self.failure_count = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.state = "closed"  # closed = normal, open = blocking
        self.last_failure_time = None

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.threshold:
            self.state = "open"
            logger.critical("Circuit breaker OPEN. Routing to fallback.")

    def can_execute(self) -> bool:
        if self.state == "closed":
            return True
        # Check if enough time has passed to retry
        if time.time() - self.last_failure_time > self.reset_timeout:
            self.state = "half-open"
            return True
        return False

When the circuit opens, all requests for that step skip directly to the next fallback tier. This prevents a degraded model from wasting tokens and time on requests it's going to fail anyway.

What we measured

After running this architecture for three months on the healthcare document processing pipeline:

End-to-end accuracy went from 85% to 97.3%
Average latency increased by 400ms for the primary path (acceptable)
Human review volume dropped from ~30% of all documents to ~12%, because the enhanced RAG fallback caught most borderline cases
Zero hallucinated patient IDs made it to the database (previously ~2 per week)

The biggest win was not the accuracy improvement itself. It was the fact that we could now quantify exactly where the system was failing and allocate human attention to the cases that actually needed it.

Key takeaways

Design for failure from day one. If your agent workflow has no fallback path, you're building a demo, not a production system.
Confidence scoring needs multiple signals. Token probabilities are not enough. Combine schema validation, self-consistency checks, and domain heuristics.
HITL is a spectrum, not a switch. Different confidence levels should trigger different levels of human involvement. Not every edge case needs real-time intervention.
Monitor the monitors. Circuit breakers and rolling error rates prevent cascading failures from eating your entire pipeline.

Wrapping up

Building reliable AI agent workflows is less about picking the right model and more about designing the right failure modes. The fallback chain pattern gave us a structured way to degrade gracefully, and the tiered HITL approach kept humans involved where they add value without turning them into full-time babysitters for the AI.

If you want to dive deeper into interrupt mechanics, the LangGraph team wrote a solid overview of the pattern.

If you're building something similar, start with the confidence scoring. Everything else follows from having a reliable signal for "how much should I trust this output."

I'm a software engineer at Adamo Software, where we build custom AI and healthcare platforms for enterprise clients.

DEV Community