Human-in-the-Loop Design for Email Agents

#ai #agents #ux #email

A refund request lands in your support agent's queue. The knowledge-base match comes back at 0.91 confidence — comfortably above your drafting threshold, with a clean article on refund policy attached. The agent should still not send that reply. If that sentence sounds wrong to you, this post is the argument for why it's right.

Autonomy is a dial, not a switch

Most teams frame human-in-the-loop as a binary: either the agent sends email on its own or a human approves everything. The framing fails in both directions. Full-auto on everything means the one bad reply ends up screenshotted in a board deck; full-review on everything means you've built an expensive draft generator that saves nobody time.

The better model is a dial set per message type, not per agent. The same support agent can run fully autonomous on order-status lookups, draft-and-approve on billing questions, and hands-off-escalate on anything with legal weight. The email support agent recipe implements exactly this, with two independent gates deciding where the dial sits for each message.

Gate one: confidence

The first gate is mechanical — how sure is the knowledge-base match? The recipe's thresholds:

Confidence	Action
≥ 0.85	Draft directly from the matched article
0.60 – 0.85	Draft conservatively, cite the source article inline so the reviewer can verify
< 0.60	Don't draft — flag for manual review with the best-guess article attached

The middle tier is the clever part. Citation-required drafts let reviewers calibrate their scrutiny: trust the high-confidence pile, check the cited ones, write the low-confidence ones themselves. That's what keeps review from becoming a second full-time job.

Gate two: risk — and it overrides

The second gate is about consequences, and it doesn't care what gate one said:

Low — password resets, FAQ-shaped questions → draft, human approves.
Medium — refunds, account changes, anything touching billing → draft, human approves with extra scrutiny.
High — legal threats, regulatory matters, fraud reports → no draft at all; escalate immediately to a person with full context.

This is why the 0.91-confidence refund reply doesn't go out: refunds are medium-risk regardless of match quality. Confidence measures "do we know the answer?"; risk measures "what happens if we're wrong?" — orthogonal questions, and conflating them is how an agent ends up committing your company to something. There's a widely cited airline-chatbot ruling about precisely that failure: the bot promised a refund policy that didn't exist, and the company was held to it.

The choke point

In code, the whole policy is small:

def handle(msg):
    question = extract_question(msg)
    article, conf = kb.search(question)

    if classify_risk(msg) == "high":
        escalate_to_human(msg, reason="high-risk topic")
        return
    if conf < 0.60:
        flag_for_review(msg, article)
        return

    draft = generate_draft(msg, article, cite_inline=(conf < 0.85))
    queue_for_approval(msg, draft, article)

Notice what's absent: a send call. queue_for_approval is the choke point — in production it drops drafts into Slack or a review tool, never directly into the outbox. The recipe states the load-bearing constraint outright: always show the draft before sending, never auto-send. Every other rule can be tuned; remove that one and you no longer have a gate, you have a delay.

If you're giving the agent its own mailbox to run this from, Agent Accounts — currently in beta — are the natural home for a support@ identity the agent owns end-to-end. They also give the choke point a native implementation: the Drafts API supports full CRUD, so the agent can create a draft in its own mailbox, a reviewer can amend it, and approval sends the existing draft — the send action on a draft behaves exactly like a regular send. The pending reply lives where the conversation lives, instead of in a screenshot pasted into Slack.

Declaring the policy instead of coding it

The same gates work as configuration. If the agent runs on a skill-file platform, the recipe shows the whole policy as a SKILL.md rather than code:

# Support agent

## Reply style
- Replies are under 120 words.
- Cite KB articles inline: [KB-1234](https://kb.example.com/1234).
- Match the tone of the inbound message.

## Drafting rules
- Always show the draft before sending. Never auto-send.
- If confidence < 0.6, do not draft — flag for human.
- Refunds, account changes, legal threats: never draft. Escalate.

## Polling
- Check the support inbox every 10 minutes.
- Process at most 5 tickets per cycle while the agent is in shakedown.

Notice the policy reads like an onboarding doc for a junior hire — which is the right intuition. The drafting rules section is the dial from earlier, written in plain English, and "always show the draft before sending" appears here too because it survives every refactor: it's a property of the system, not of any particular implementation.

Loosening the dial, slowly

Human-in-the-loop isn't a permanent tax; it's how you earn the data to automate more. The recipe's shakedown numbers: process 5 tickets per cycle while tuning the matcher and risk classifier, bump to 20 once the false-positive rate is acceptable. Polling every 5–15 minutes is plenty for support latency — and simpler than webhook fan-out on a shared inbox.

Two practices make the loosening defensible. Log everything — every classification, KB lookup, and approval decision — so when you propose moving a message type from draft-and-approve to full-auto, you argue from a thousand logged approvals, not vibes. And track what the agent can't match: those tickets are your map of missing KB articles.

The counterargument deserves its hearing: review fatigue is real, and a human rubber-stamping 200 drafts a day is barely a gate. Two answers. First, tier aggressively — automate the genuinely safe tiers so human attention concentrates where it changes outcomes, and batch the repetitive stuff: when three "where's my receipt?" tickets arrive in a row, that's one KB article, one draft template, one reviewer pass, not three. Second, remember what the gate is actually defending against. The recipe puts it bluntly: even at 99% accuracy, the 1% that makes legal commitments destroys trust faster than the 99% builds it. Fatigue is a workload-design problem with workload-design fixes. The 1% is not.

Map your own agent's message types into the three risk tiers this week, and set the dial per tier. Which message type are you currently over-reviewing — and which one, honestly, shouldn't be on full-auto?