DEV Community

Praveen Govindaraj
Praveen Govindaraj

Posted on

The Plumbing Beneath the Magic


Photo by Compagnons on Unsplash

What it actually takes to run an AI agent in production

This is a follow-up to “Process Automation in the Agentic Era.” The short version of that piece: 71% of organisations are using AI agents, only 11% have anything in production, and the gap is being closed by a quiet rebuild of the infrastructure layer underneath. This piece picks up where that one left off.

A friend who runs operations at an insurance company told me something a few weeks ago that has stayed with me.

“We don’t have an AI problem. We have a who-just-did-that problem.”

Their claims process used to be entirely human. A claim came in. A handler read it. A supervisor approved it. A check went out. If anything went wrong, you could trace it back to a person, a date, a signature.

Then they piloted an AI agent. Same workflow, mostly. The agent reads the claim, summarises it, recommends an action, escalates to a human on edge cases. It works beautifully — about 80% of the time. The other 20% is the problem. Not because the agent is wrong; because when the agent is right, nobody can tell how it got there. And when it is wrong, nobody can prove they would have caught it.

Her team did not roll back the pilot. But they did not expand it either. They got stuck — like most enterprises — in the no-man’s-land between interesting and trustworthy.

This is the plumbing problem. And to talk about it with some illustrations

What an agent actually does, in shape

Before we talk about the infrastructure, here is what a single agent doing a single task actually looks like — at least in the way most platforms model it today.

Figure 1. The bare-minimum picture of an agent task.

That is it. That is the whole picture. An input arrives. An agent — which is really a model holding a loop, allowed to call some tools — chews on it until it produces an output.

This is what most demos show you. And if you only had to run one of these, in a sandbox, with no auditor watching, you would be done. The model is smart enough. The tools work. Job over.

But a business does not run one of these. A business runs ten thousand of them per day, in dozens of variations, embedded in workflows that touch real money and real people. And once you do that, the simple picture above is wildly insufficient.

What was missing, and what is now being added

The agent in the box above does not know it is part of a process. It does not know its budget. It does not know whose approval is required before it can act. It does not know how to escalate. It does not know what to log. It does not know what its peers are doing. It does not know what to do if it succeeds — or fails — or just hangs.

All of these are the responsibility of the layer around the box. And that layer is what is being built right now.

Here is the simplest sketch of it.

Figure 2. The agent at the centre, surrounded by the infrastructure layer being built across the industry.

The agent is in the middle. Around it are six things, each of which used to not exist as a discrete concept in the AI agent world, and each of which is now being built — or has already been built — by every serious platform.

Reading clockwise from the top: there is the process orchestration — the thing that knows the workflow has eight steps and what order they go in. There is observability — every model call, every tool invocation, every decision recorded with timestamps. There is human checkpoints — formal places where the workflow pauses and waits for a named approver. There is the audit trail — a record built for the regulator, not the engineer. There is budgets and gates — economic guardrails that prevent a workflow from costing $40 when it should cost $4. And there is the process registry — versioned, signed artifacts that compliance can sign off on once and not have to re-review every Tuesday.

None of these are exotic. Every one of them existed in classical enterprise software. The trick — and the thing the new generation of platforms is figuring out — is reassembling them in a way that does not crush the agent’s productivity in the process.

The shape of a real workflow

Let us make this less abstract. Here is a real one — slightly stylised, but representative — from the kind of process my insurance friend deals with every day.

Figure 3. A claim-processing workflow with two agent tasks, a decision gate, and a human checkpoint for high-value claims.

A claim arrives. An agent extracts the structured data. A second agent runs a fraud check. Then there is a decision — is the claim over $50,000? If so, it pauses and waits for a human adjuster. If not, the agent proceeds to payout.

Notice what this picture has that the first one did not.
It has named steps. Each step has a place in a workflow that an analyst can read without seeing any code. It has a decision with explicit branching logic — not a probabilistic “the agent decides what to do next,” but a deterministic gate. It has a human checkpoint — formal, mandatory, gated by an authority threshold. And it has an end — a clear point at which the work is done.

This is not an AI invention. This is how every business process has been drawn for the last twenty years. The notation is called BPMN. It was standardised in 2011. The reason it is suddenly relevant is that this is the only notation that the people who own these processes — claims supervisors, compliance officers, auditors — can read. The agent went from being the whole story to being a participant in a story that pre-existed it.

The two checkpoints
If you take only one technical idea away from this piece, make it this one. There are two kinds of human-in-the-loop, and most platforms only ship one of them.

Figure 4. Process-level checkpoints gate a whole workflow step. Tool-level gates lock a single tool while the agent uses other tools freely.

The first is the process-level checkpoint — a whole step in the workflow pauses and waits for a human. The insurance claim over $50,000, paused for an adjuster. A loan over $750,000, paused for a senior underwriter. The contract change, paused for legal. These are old. Banks have done this for decades. AI agents inherit them naturally.

The second one is newer, and it is the one that actually lets you trust the agent to do unsupervised work. It is the tool-level gate — and the difference is subtle but enormous.

Imagine you give an agent ten tools. Read this database. Search this knowledge base. Send this email. Delete this customer record. You do not want the agent to ask before reading. You do not want it to ask before searching. You probably do not want it to ask before sending a routine confirmation email. But the delete? The delete you want gated. Always. No matter what the agent thinks. No matter how confident the model is.

The clever way platforms have started shipping this is to attach the gate to the tool, not the agent. The agent does its thing autonomously — until the moment it tries to invoke the gated tool, at which point a human approval request fires off and the agent freezes. The instant the human approves (or denies), the agent resumes. The agent does not even know there is a human in the loop; it just sees a tool call that took eleven minutes to return.

This sounds like a small distinction. It is not. The reason most enterprise AI pilots stall is that the only tool to constrain them was the prompt. You would write a prompt that said, in earnest English, “if you are about to delete a record, ask the user first.” And then sometimes the model would do that, and sometimes — at three in the morning, on Tuesday, after a context switch — it would not. And you would have a deleted record and an apologetic post-mortem.

A tool-level gate makes that conversation impossible. Not “please ask before deleting.” The delete tool will not function until a named human approves.

Mechanical. Auditable. Boring. Exactly what regulators want to see, and what most AI products still do not have.

What the user actually sees
The thing that strikes you, when you watch an experienced operations manager use one of the new platforms, is how unspectacular it looks.

The fancy AI demo where an agent books your trip is not, it turns out, what people want. What they want is something that looks like a project management dashboard from 2014. Rows of running instances. Status badges. Filters. A list of approvals waiting for you. A cost ticker that you can drill into. A search box.

The magic, when there is any, is that the rows are AI agents — not humans clicking buttons. But the interface is deliberately unmagical. Because the people who run these systems are not impressed by magic. They are impressed by predictability.

This is, in the end, the philosophical accommodation we are making.

In the autocomplete era we agreed that intelligence could flow through a person and still belong to them. In the agentic era, we are agreeing that intelligence can flow past a person — but only inside a structure that makes the flow visible, attributable, and reversible. The agent gets autonomy. The infrastructure gets accountability. Each gives up something the other needed.

The boring interface is not a failure of imagination. It is the price.

What is still missing
If I were betting on what the next two years bring, it would be these.

Process mining for agentic workflows. Right now we design the process and then run the agents through it. Soon we will do it backwards — let the agents run, mine the actual paths they took, and let the system propose the workflow that fits what is actually happening. Bottom-up rather than top-down. The classical BPM world has done this for years; the agentic world is about to inherit it.

Federation across enterprises. Today every platform has its own process registry. Tomorrow we will need cross-organisation registries — when an agent at your bank talks to an agent at my insurer, both processes need to interoperate without leaking data. The standards work has not started yet. It will.

Cost-aware autonomy. Today’s gates are static — “do not promote if cost rises by more than 20%.” Tomorrow they will be dynamic — “if you are approaching the budget envelope on this instance, switch from the expensive model to the cheap one and notify ops.” Agents that route themselves through cost-quality tradeoffs in real time. Vellum hints at this. Nobody ships it yet.

A genuinely shared visual language. BPMN is great at workflows but not at the agentic specifics — tool gates, knowledge pipelines, agent skills. Either BPMN gets extended (the OMG is glacial), or a sibling notation emerges. Either way, by 2028 we will have a way for an analyst at a bank, an engineer at a startup, and an auditor at a regulator to look at the same diagram and agree on what it says. We do not have that today.

Why this is worth caring about
It is tempting to read this and think: well, that is a lot of plumbing for a space that is mostly hype. Maybe it is true. Maybe agents fizzle and we go back to writing scripts.

But the thing my insurance friend said keeps coming back. We do not have an AI problem. We have a who-just-did-that problem.

That problem is not AI-specific. It is at least as old as the printing press. Every time intelligence becomes easier to produce and cheaper, the question of accountability gets harder, and a new layer of infrastructure has to be invented to answer it. Notaries. Signatures. Receipts. Audit logs. Version control. Every one of these was once a clever new thing; now they are invisible furniture.

The plumbing being built right now — process registries and tool gates and cost SLOs and trace replays — is just the next set of furniture. In ten years we will not talk about it any more than we talk about HTTPS today. It will simply be the layer that makes intelligence runnable in places where intelligence could not safely run before.

But it is the one that takes the 11% to 60%.

Note : Diagrams in this article are simplified for clarity. Real-world implementations involve more components, more edge cases, and more subtle interactions than any single illustration can capture. The principles are the same; the wiring is messier.

Top comments (0)