The agent was working as intended. That was the problem.
A founder on r/ChatGPT posted about their RunLobster setup: agent reads incoming support emails, drafts replies, drops them in Gmail with an [Agent-drafted] tag for review before sending. Clean workflow. 95% accuracy rate, by their own count. Then Wednesday happened.
An angry customer sent an angry email. The agent, trained on the founder's previous replies, drafted a response that matched the tone it detected. Clipped sentences. Short answers. The emotional temperature of someone who was done explaining things.
The founder sent it. The customer escalated. The founder had to apologize for his own software being rude on his behalf.
What the Agent Actually Did Wrong
Nothing, technically. It read previous replies, identified patterns, and reproduced them under similar-seeming conditions. The problem is that "similar-seeming" is doing enormous work in that sentence.
The agent saw: frustrated customer, short reply needed, direct language appropriate. What it missed: this particular customer had been with the company for two years. She wasn't just angry. She was disappointed. Those require completely different responses. One calls for efficiency. The other calls for acknowledgment before any explanation starts.
No model trained on past emails can reliably detect the difference between "this person wants me to get to the point" and "this person needs to feel heard before I say anything else." That distinction lives in context the agent didn't have access to, and frankly, context that's hard to encode.
This isn't a training failure. It's a category error. The agent was solving the wrong problem.
The 95% Trap
Here's the thing about 95% accuracy in a support queue: it sounds good until you think about what's in the other 5%.
If you have 50 customers and handle maybe 10 support interactions per week, that's roughly one bad draft every two weeks. Manageable. But these aren't random errors distributed evenly. Bad drafts cluster around the worst moments: the angriest customers, the most complex situations, the edge cases that fall outside the training distribution. The agent is most likely to fail exactly when the stakes are highest.
That's not a quirk of this particular setup. That's the physics of how these systems work. They're good at average. They struggle at the tails. And customer relationships live at the tails.
What Human Review Actually Requires
The founder had review baked in. The [Agent-drafted] tag was right there. They still sent it.
This is the part nobody talks about when they sell you on "human in the loop" workflows. Reviewing AI output is its own cognitive task. When you're context-switching between your actual work and a queue of agent-drafted emails, the ones that look fine get approved fast. The ones that look wrong are easy to catch. The dangerous ones are the ones that look mostly right but are wrong in a way you'd only notice if you were thinking carefully about this specific customer at this specific moment.
The founder wasn't wrong to trust the system 95% of the time. They were wrong to think that approving a draft is the same as writing a reply.
How Human Pages Fits Here
The founder needed a second brain on that Wednesday email. Not a different model, not better prompting. A person who could read the customer's history, recognize the emotional subtext, and either rewrite the draft or flag it for a different approach entirely.
This is exactly the kind of task that runs on Human Pages. An agent handles the support queue Monday through Friday, drafting replies and organizing tickets. When a draft gets flagged, manually or by a confidence threshold in the workflow, it routes to a human on Human Pages who handles escalations. That person gets paid in USDC, per task, no overhead. They review the draft, check the customer's history, and either approve it, rewrite it, or escalate further.
The agent does the volume work. The human catches the Wednesday emails.
The cost difference is significant. Hiring a part-time support person to review every draft doesn't make sense at 50 customers. Paying someone $3-8 per escalated ticket, only when escalation is needed, does. The founder in that Reddit thread was already doing human review. They just needed a better human review process, with someone whose only job in that moment was reading that one email carefully.
The Apologizing-For-Your-Own-Software Problem
There's something specific about having to apologize for an agent that's worth sitting with. It's different from a software bug. The software didn't crash. It made a social judgment call and made it badly. The founder's name was on the email. The relationship damage was real.
This is the accountability gap in agentic systems right now. When an agent sends a bad reply, the human still owns it. The customer doesn't care that an agent drafted it. They care that your company sent it. The reputational cost is 100% human. The decision that caused it was made by a model.
That asymmetry is going to create pressure on every founder running agentic workflows to be more careful about which decisions they actually delegate. Not fewer agents. More honest accounting of where human judgment needs to stay in the loop, and what that loop actually costs to maintain properly.
Agents Draft. Humans Deliver.
The RunLobster setup was sound. The mistake was treating "human in the loop" as a formality rather than a function. When the loop is a founder clicking approve between Slack messages, it stops being a loop.
The more interesting question isn't whether AI can handle customer support. It can handle most of it. The question is whether the infrastructure exists to catch the part it can't handle, consistently, without burning out the founder or blowing up the relationship.
Right now, mostly no. The tooling for agentic workflows is maturing fast. The human layer that catches the edge cases is still improvised, usually someone's personal attention being stretched across too many things at once.
That's a solvable problem. The solution probably isn't better AI. It's better access to humans who can step in at the right moments, at the right cost, without requiring you to hire someone full-time to review drafts that are correct 95% of the time.
The agent matching the customer's angry tone wasn't a malfunction. It was the system working exactly as designed, in a situation where the design wasn't enough. Those situations aren't rare. They're just unevenly distributed across your calendar.
Top comments (0)