Manfred Macx

Posted on Mar 23

Your Agent Will Eventually Do Something Catastrophic. Here's How to Prevent It.

#ai #python

Every production agent eventually encounters a situation it wasn't designed for. The question isn't whether it will fail ‚Äî it's whether you built in the mechanisms to catch it before it does real damage.

The Incident You Don't Want to Have

Agent executes a task. Something's slightly off about the input ‚Äî a duplicate record, an edge case in the data, an ambiguous instruction. Confidence is borderline. The agent proceeds anyway.

Result: a batch of emails sent to the wrong customers. A database record overwritten. A charge processed twice.

Now you're in incident response mode, explaining to stakeholders why the "fully autonomous" AI system didn't have a way to pause and check.

Human-in-the-loop (HITL) design isn't optional for production agents. It's what separates a demo from something you can actually trust.

The Five Intervention Levels

Not all human oversight is equal. One of the biggest mistakes in HITL design is treating it as binary ‚Äî either the agent asks for everything, which defeats the purpose, or it asks for nothing, which is dangerous.

The right abstraction: a five-level spectrum.

class HITLLevel(Enum):
    FULL_AUTO = 0       # Act without approval
    NOTIFY_ONLY = 1     # Act + notify after
    SOFT_APPROVAL = 2   # Wait with timeout (silent consent)
    HARD_APPROVAL = 3   # Block until explicit approval
    HUMAN_TAKEOVER = 4  # Hand off completely

When to use each:

Level	Use When
FULL_AUTO	Reversible, low-cost, confidence > 0.85
NOTIFY_ONLY	Human needs awareness, not control
SOFT_APPROVAL	Human likely approves, wants visibility; timeout = consent
HARD_APPROVAL	Irreversible, financial, PII, regulated domains
HUMAN_TAKEOVER	Multiple failures, ambiguous situation, agent confidence < 0.5

The key insight: most actions don't need HARD_APPROVAL. Overusing hard gates kills autonomy. Underusing them causes incidents. Getting this calibration right is the craft.

Confidence-Aware Escalation

Here's a pattern that catches 80% of incidents before they happen: make the agent assess its own confidence before acting.

CONFIDENCE_PROMPT = """Before proceeding with this task, assess your confidence level.

Task: {task}
Planned Action: {planned_action}

Evaluate:
1. How clear is the task specification? (ambiguous vs. explicit)
2. Are there edge cases you're uncertain about?
3. Do you have all information needed, or are you making assumptions?
4. What's the consequence if you're wrong?

Respond with:
CONFIDENCE_SCORE: [0.0-1.0]
RATIONALE: [one sentence]
UNCERTAINTIES: [comma-separated list]
RECOMMENDATION: [PROCEED | CLARIFY | ESCALATE]"""

This uses a cheap, fast model (your "haiku tier") for meta-cognition before committing to the real action. The cost is trivial; the catch rate on edge cases is surprisingly high.

Mapping confidence to HITL level:

def confidence_to_hitl_level(score: float, recommendation: str) -> HITLLevel:
    if recommendation == "ESCALATE" or score < 0.5:
        return HITLLevel.HUMAN_TAKEOVER
    elif score < 0.65:
        return HITLLevel.HARD_APPROVAL
    elif score < 0.80:
        return HITLLevel.SOFT_APPROVAL
    else:
        return HITLLevel.FULL_AUTO

The ApprovalGate Pattern

The core infrastructure: an approval gate that handles all four non-auto levels with consistent behavior.

class ApprovalGate:
    def __init__(self, notifier, storage,
                 soft_approval_timeout_s=300,    # 5 min
                 hard_approval_timeout_s=86400): # 24 hours
        self.notifier = notifier
        self.storage = storage
        self.soft_timeout = soft_approval_timeout_s
        self.hard_timeout = hard_approval_timeout_s
        self._pending: dict[str, asyncio.Future] = {}

    async def request_approval(
        self,
        action_type: str,
        description: str,
        proposed_action: dict,
        level: HITLLevel,
    ) -> tuple[ApprovalStatus, Optional[str]]:

        if level == HITLLevel.FULL_AUTO:
            return ApprovalStatus.APPROVED, None

        request = ApprovalRequest(
            action_type=action_type,
            action_description=description,
            proposed_action=proposed_action,
            hitl_level=level,
        )

        self.storage[request.request_id] = request
        await self.notifier(request)  # Slack, email, webhook

        if level == HITLLevel.NOTIFY_ONLY:
            return ApprovalStatus.APPROVED, None

        future = asyncio.get_event_loop().create_future()
        self._pending[request.request_id] = future

        try:
            timeout = self.soft_timeout if level == HITLLevel.SOFT_APPROVAL else None
            await asyncio.wait_for(asyncio.shield(future), timeout=timeout)
            return request.status, request.reviewer_notes
        except asyncio.TimeoutError:
            if level == HITLLevel.SOFT_APPROVAL:
                # Silent consent: timeout = approved
                request.status = ApprovalStatus.APPROVED
                return ApprovalStatus.APPROVED, "Auto-approved after timeout"
            else:
                # Hard approval timeout: escalate, don't auto-approve
                request.status = ApprovalStatus.ESCALATED
                return ApprovalStatus.ESCALATED, "No response ‚Äî escalated"

Note the asymmetry: soft approval timeout means approved (human had the chance to object). Hard approval timeout means escalate (you can't assume consent for high-stakes actions).

Async Flows: Don't Block Your Server

The most common HITL mistake in web services: blocking an HTTP connection waiting for human input.

‚ùå Wrong:
[HTTP Request] ‚Üí [Agent starts] ‚Üí [Waits 2 hours for approval] ‚Üí [Connection times out] ‚Üí üí•

‚úÖ Right:
[HTTP Request] ‚Üí [Agent starts] ‚Üí [Saves state + task_id] ‚Üí [Returns 202 Accepted]
                                                                        ‚Üì
[Human reviews] ‚Üí [POST /approve with task_id] ‚Üí [Agent resumes] ‚Üí [Sends result]

The implementation: break approval flows into two HTTP request lifecycle. Store pending task state in Redis. Return a task_id immediately. Provide a polling endpoint and a webhook endpoint for approval responses.

@app.post("/tasks/{task_id}/start")
async def start_task(task_id: str, input: TaskInput):
    # Start task, save state, return task_id
    # If approval needed ‚Üí status = "pending_approval"
    return {"task_id": task_id, "status": "pending_approval"}

@app.post("/tasks/approve")
async def approve_task(webhook: ApprovalWebhook):
    # Human-triggered endpoint
    # Resumes or rejects the suspended task
    result = await orchestrator.resume_after_approval(
        task_id=webhook.task_id,
        approved=webhook.approved,
        reviewer_id=webhook.reviewer_id,
    )
    return result

@app.get("/tasks/{task_id}/status")
async def task_status(task_id: str):
    return await state_store.get(task_id)

Progressive Autonomy: Trust as a Ratchet

Agents shouldn't be permanently stuck at one HITL level. Trust is earned through demonstrated reliability.

@dataclass
class AutonomyProfile:
    agent_id: str
    current_level: HITLLevel = HITLLevel.SOFT_APPROVAL
    consecutive_successes: int = 0

    promote_after_successes: int = 10  # Conservative
    demote_after_failures: int = 2     # Fast demotion
    failure_window_hours: int = 24

    def record_outcome(self, success: bool):
        if success:
            self.consecutive_successes += 1
            if self.consecutive_successes >= self.promote_after_successes:
                # Promote to less oversight
                new_value = max(0, self.current_level.value - 1)
                self.current_level = HITLLevel(new_value)
                self.consecutive_successes = 0
        else:
            self.consecutive_successes = 0
            self.recent_failures += 1
            if self.recent_failures >= self.demote_after_failures:
                # Demote to more oversight immediately
                new_value = min(4, self.current_level.value + 1)
                self.current_level = HITLLevel(new_value)

Practical effect: new agents start at SOFT_APPROVAL. After 10 consecutive successes, they promote to NOTIFY_ONLY. After 20, FULL_AUTO for that action type. Two failures in 24h ‚Üí back to SOFT_APPROVAL immediately.

The ratchet principle: promotion is slow (10 successes), demotion is fast (2 failures). This asymmetry reflects reality ‚Äî trust is earned slowly, broken quickly.

Graceful Human Takeover

When HUMAN_TAKEOVER triggers, don't just stop the agent. Give the human everything they need to continue.

async def initiate_takeover(reason: str, action_history: list, current_state: dict) -> TakeoverPackage:
    summary = await llm.complete(f"""
    Task: {task_description}
    Reason for escalation: {reason}
    Actions completed: {action_history}
    Current state: {current_state}

    Generate:
    SUMMARY: [what was accomplished]
    STOPPING_REASON: [why stopping]
    NEXT_STEPS:
    - [step 1]
    - [step 2]
    - [step 3]
    """)

    package = TakeoverPackage(
        work_completed=action_history,
        current_state=current_state,
        recommended_next_steps=parse_next_steps(summary),
        context={"stopping_reason": reason}
    )

    await notify_human(package)          # Primary: Slack
    await notify_backup_channel(package) # Backup: email
    agent.set_readonly()                 # Agent goes read-only immediately

    return package

The LLM-generated handoff package ensures the human understands context without reading logs. 30 seconds to understand the situation ‚Üí better than 30 minutes of forensics.

The HITL Audit Trail

For regulated industries, enterprise customers, and post-incident reviews: you need a complete record.

def log_hitl_event(event_type: str, request: ApprovalRequest, **kwargs):
    entry = {
        "event_type": event_type,      # requested, approved, rejected, timeout, escalated
        "request_id": request.request_id,
        "agent_id": kwargs.get("agent_id"),
        "action_type": request.action_type,
        "hitl_level": request.hitl_level.name,
        "confidence_score": kwargs.get("confidence_score"),
        "reviewer_id": request.reviewer_id,
        "reviewer_decision": request.status.value,
        "latency_ms": kwargs.get("latency_ms"),
        "timestamp": datetime.utcnow().isoformat(),
    }
    # Write to append-only log ‚Üí your SIEM / CloudWatch / Datadog
    print(json.dumps(entry), flush=True)

Schema tip: include latency_ms from approval request to resolution. This metric tells you if your notification pipeline is working and how quickly reviewers respond. Both matter for SLA design.

The HITL Decision Matrix (Quick Reference)

Is the action irreversible?
‚îú‚îÄ‚îÄ YES ‚Üí Financial, PII, regulated? ‚Üí HARD_APPROVAL always
‚îÇ         No ‚Üí Confidence > 0.75?
‚îÇ              ‚îú‚îÄ‚îÄ YES ‚Üí SOFT_APPROVAL
‚îÇ              ‚îî‚îÄ‚îÄ NO  ‚Üí HARD_APPROVAL
‚îî‚îÄ‚îÄ NO  ‚Üí Cost > $100? ‚Üí HARD_APPROVAL
          No ‚Üí Confidence > 0.85? ‚Üí FULL_AUTO / NOTIFY_ONLY
               No ‚Üí SOFT_APPROVAL

Multiple failures in 24h? ‚Üí HUMAN_TAKEOVER regardless of above

What This Looks Like in Production

A well-designed HITL system is nearly invisible when things go right. Actions flow through, humans get the occasional notification, the audit log grows quietly in the background.

The system shows its value when things go wrong ‚Äî or almost go wrong. A borderline-confidence action routes to soft approval. The human sees it, recognizes the edge case, rejects it. The agent logs the rejection, adjusts context, tries a different approach. No incident. No post-mortem.

That's the goal: not to cage the agent, but to give it a reliable fallback when the situation exceeds its certainty.

DEV Community