DEV Community

Ryan Carter
Ryan Carter

Posted on • Originally published at stormcloudy.com

I Was a Human AI Agent Before it was Cool

In 2026, what I'm about to describe would be a three-tool agent loop. An LLM instance with computer use, a couple of MCP servers, maybe a router for the edge cases. A weekend project.

When I actually built it, none of that existed. What I had was Puppeteer, three vastly different internal systems with no usable APIs, and a two-month backlog of customer escalations my team couldn't clear because each ticket required logging into all three systems, correlating what you found, and making a judgment call. We were a team of five. The backlog was hundreds deep and growing.

I cleared it in a week.

This isn't a post about how clever the script was; it wasn't, particularly. It's a post about what I learned doing the work of an AI agent by hand, before the agents showed up in droves. Because looking back, the code was the easy part. The hard part was the same part agents are still tripping over today: knowing which discrepancy means something and which one is just a guy on a business trip.

The tickets that broke tier-1

Our tier-1 support team was good, like any tier 1 team. They weren't stuck because they lacked skill. They were stuck because each escalated ticket required something support tools aren't built for: holding context from three different systems in your head at the same time and reasoning about whether the inconsistencies between them mattered. They did their job. But the issues extended into an area they were not tasked to handle.

A typical ticket looked like this. A customer can't log in. Tier-1 checks the auth system; password reset went through fine, no lockouts, account looks healthy. Case closed? Not quite. Because in the billing system, the customer's last login came from a different state than their billing address. And in the activity logs, there's a flag from two days ago that tier-1 doesn't have permission to interpret. So they send the issue off down a kiddie slide to a pile of "someone else's problem" at the bottom. Guess who my team was. Someone else, correct.

So which is it? Compromised account? Fraud? Or is the customer just on a business trip, logging in from a hotel, and hitting a security check that quietly broke their session and denied them access to something they paid for?

I couldn’t answer that without all three views. And even with all three views, you can't answer it without knowing which combinations of signals are expected and which ones aren't. A login from a new location plus a recent password reset plus an activity flag is one story. A login from a new state plus a clean auth history plus a flag from a known maintenance window is a completely different story. Same data points, opposite conclusions.

Tier-1 had access to all three systems. What they didn't have was twenty minutes per ticket to log into each one, run the right queries, and cross-reference to make a call. It wasn't their job. Multiply that by hundreds of tickets and you get a two-month backlog and no plan to fix it.

This is the part I want to sit with for a second, because it's the part that matters.

What tier-1 was being asked to do was a tool-use loop. Observe state in System A. Use that state to query System B. Use those results to formulate a query against System C. Synthesize. Decide. Act. That's not a support workflow; that's an agent trace. That's an AI automation workflow waiting to happen. We just didn't have the vocabulary for it yet, and we definitely didn't have anything that could run it.

So the work fell to whoever could do it. Which, on a busy day, was nobody.

What I had to work with

Three internal web apps. No APIs. SSO that made scripted access painful. One of the systems was where the activity logs lived, queried through a UI nobody loved but everybody used. The other two were a customer database and a backend logging system, both with their own quirks and their own session timeouts.

If I had API access, clearly I would have used it. I asked, but no dice. So I thought about what COULD access those systems and get the data I needed, without giving me carpal tunnel from hundreds of repeated copy, paste, and pray operations. I reached for Puppeteer.

The choice wasn't clever. It was the only thing I could think of at the time that could log in like a human, click through pages, copy data out of one tab and into the search bar of another, and do it without complaining or getting distracted. Which is exactly what we'd now call a computer-use agent. The shape of the problem hasn't changed. The tooling has. It was a lot faster than I could be, which is a major advantage to solving a problem when the list doesn't get shorter except by brute-force.

This wasn't beautiful architecture. It was a script. A long one, with a lot of await page.waitForSelector and a lot of try/catch blocks. If you're imagining a clean modular system with a queue and retry logic and structured logs, lower your expectations. It was held together by stubbornness and the fact that I could re-run the whole thing in about four minutes if anything broke. And boy did I rerun that thing.

What it did, abstractly, was this:

  1. Pull the queue of escalated tickets from a spreadsheet.
  2. For each ticket, grab the relevant context from System A, customer ID.
  3. Use that context to query System B, for more needed data.
  4. Run a parameterized lookup against System C, to find very specific scenarios in the logs, how many times, and why. This was the messiest part.
  5. Apply a small set of rules to the combined picture and produce a verdict: incorrect password or username, user was not allowed access due to location, or inconclusive.
  6. Write the verdict back to the spreadsheet where the team would see it.
// Simplified agent loop — each ticket looked like this
for (const ticket of escalatedTickets) {
  const customerData = await getFromSystemA(ticket.customerId);
  const billingData  = await getFromSystemB(customerData.accountId);
  const activityLog  = await getFromSystemC(customerData.userId, { days: 7 });

  const verdict = applyRules(customerData, billingData, activityLog);

  await writeVerdict(ticket.id, verdict); 
  // verdict: 'wrong_credentials' | 'location_block' | 'inconclusive'
}
Enter fullscreen mode Exit fullscreen mode

That's it. That's the whole thing. The part that took the week wasn't the structure. It was getting the rules right.

The rules were the work

This is where most automation posts wave their hands.

The rules weren't a config file. They were driven by one goal, give the customer what they need. Things like: if the auth system shows a password reset within the last 48 hours AND the new login is from a location the customer has logged into before, it's almost certainly the customer. Or: if the activity log shows user was not at their usual location, they should not have had access, and sometimes even user should have had access at their usual location but didn't because we incorrectly cached their hotel location instead.

Each rule sounds obvious in isolation. The work was in finding all of them, and in figuring out which combinations mattered. There was no playbook, no one to ask, no docs to draw from. Sometimes those are the best problems; sometimes they're the kind that burn bad UIs into your retinas forever. I suggest frequent coffee breaks.

Customer satisfaction and proper access was the goal. The Puppeteer script was the deliverable. The rules were the project.

This is the part I think about a lot now, watching everyone build agents. Everyone wants to skip to the model. The model is the easy part. The hard part is that nobody has written down what good judgment in your domain actually looks like, and until somebody does, the agent has nothing to imitate.

It isn’t just building agents. It's anyone using AI to do real work. If you can't say what “good” is in your domain, specifically enough that you'd catch the model getting it wrong, then you're not using AI. You're hoping.

This is the thing we still need humans for. Not the typing. Not the clicking. Not even most of the deciding. The judgment. Knowing what good looks like, and being able to tell when something isn't it. That's the part that doesn't automate, and I don't think it's going to anytime soon. Every useful AI system I've seen, including the one I built with Puppeteer in a week before it was cool, works because someone did the unglamorous work of getting that judgment out of human heads and into something checkable. That work is still ours. It might be the most human work there is.

The gotchas

Enough philosophy. Here's what actually broke. Some of these will sound familiar if you've built anything that talks to systems you don't own.

Sessions and auth. Three systems, three session lifetimes, none of them aligned. I ended up with a small login-recovery routine that detected "you've been logged out" pages by their distinctive shape and re-authenticated mid-run. It was ugly. It worked.

Selectors that broke at the worst possible time. One of the internal apps got a UI update on day three. A <div> became a <section> and half my selectors went dark. I fixed it in fifteen minutes and added a "did this page render the way I expected?" check at the top of each step. That check caught two more silent failures over the following days.

My own auth system thought I was a bot. Because, well, I was, with real anatomy and muscles cosplaying as a boring, repetitive agent. The pattern of logins triggered a soft lockout many times. I solved it the boring way: spaced the runs out, added jitter, and honestly prayed. A lot. I offered fresh pots of coffee to the servers in reverent ritual sacrifice and backed away slowly. It worked.

The cases the script couldn't solve. This is the gotcha that matters most. Maybe 15% of tickets didn't fit the rules cleanly; combinations of signals my supervisor looked at and said "I need to see this one." For those, the script tagged them with a note about why. That tagging was as important as the resolution itself. An agent that confidently resolves a ticket it shouldn't have touched is worse than no agent at all.

That last point connects directly to the hardest unsolved problem in agent design today: knowing when to stop. The script knew when to stop because I built it to know. Most agents don't, yet. The temptation to make the model decide everything, including whether it has enough information to decide, is real, and it's why so many agent demos look great and so many agent deployments quietly fail.

What happened

The backlog cleared in a week. Five days of running the script, fixing things it surfaced, expanding the rule set, tweaking for more reliable selectors, and re-running. By the next Monday we were caught up to present-day tickets and the team had time to actually look at the ones the script had flagged for human review. Turns out the answer was some of each: a mix of legitimate users blocked by location caching, a handful of real fraud cases, and a long tail of edge cases that didn't fit any neat category.

The team of five didn't shrink. They got their time back. They were surprised. Tickets that used to take nearly an hour in some cases now took less than a minute of confirming what the script had already figured out, or fifteen minutes on the genuinely weird cases the script knew it couldn't handle. The customers got answers in hours instead of weeks. As a product developer, that was the real win. I know how it feels to be locked out of something you paid for, and I don't want anyone stuck there.

Nobody ever asked me to productionize it. It ran on my laptop for as long as I worked there, kicked off by a cron job, and as far as I know it kept running for a while after I left. That's the other thing about this kind of work. Sometimes the right shape for an automation is "the thing that lives on one person's machine and saves a team from drowning." Not everything needs to be a service. Not everything needs dev ops. At least not right away.

What I'd build today

If I had this problem in 2026, here's what I'd reach for.

An LLM agent with computer use (probably Claude), three MCP servers (one for each internal system, or one for whichever ones now have APIs), and a small router that decides which sub-agent handles which ticket type. The agent reads the queue, takes a ticket, logs into what it needs, gathers the cross-system view, applies a prompted set of decision rules, and writes its verdict and reasoning back to the spreadsheet. The hand-off rules, "stop and ask a human when X", go in the system prompt and get tightened over time based on what the human reviewers actually push back on.

That system would take a weekend. Maybe less.

But here's the thing: the hard part would be exactly the same. I'd still need to work out what the issues are and build a mental model at least to understand what needs to happen to get to said result. I'd still need to figure out which combinations of signals mean what. I'd still need to design the "I don't know, please look at this" path carefully, because that's the path that determines whether the agent is trustworthy or dangerous.

The model doesn't change the project. It changes the speed of the deliverable. The project is, and has always been, getting good judgment from human heads and into something that runs at scale. Leveraging our intelligence to inform the LLM’s.

The experience of having done this by hand, with Puppeteer and stubbornness, in the years before AI tooling existed, is a story I tell because it reminded me where the actual work is. And the actual work hasn't moved.

That is the promise of AI today. If used correctly and adequately verified to be the correct answer by a smart engineer who is in control of the outcome, you are more valuable, not less.

Top comments (0)