Nick Talwar

Posted on May 19

6 Things Your AI Agents Need That You're Probably Not Building

#ai #aiagents #enterprisetechnology #softwareengineering

The infrastructure that separates agents that demo well from agents that actually run

You would never bring a new hire onto your team without performance feedback, escalation paths, or a way to know when they're struggling. Yet that's exactly how most organizations deploy AI agents. MIT Sloan and BCG's 2025 research found that 76% of executives now describe agents as coworkers rather than tools, but almost none of them are managing agents that way. They ship the agent and move on.

Deciding to call your agents “coworkers” is easy. Setting up the feedback loops, escalation paths, and failure signals that actually make one is where teams stall. It's almost entirely an infrastructure problem, and these are the six pieces most teams skip.

1. Evaluation Frameworks

A working agent and a reliable agent are two different things. Evaluation frameworks give you the ability to measure the difference before your users discover it for you. This means building structured test suites that run against your agent's outputs on a regular cadence, scoring for accuracy, relevance, and task completion across a range of realistic scenarios.

Good evaluation suites include both deterministic checks (did the agent call the right tool with the right parameters?) and judgment-based scoring (was the response actually useful to the person asking?).

The key is that evaluation has to be continuous, running in CI/CD pipelines and against live traffic, because agent behavior shifts as underlying models update and data distributions change. LLMs, the technology that undergirds agents, are at their core probabilistic in nature, which means there is an often opaque statistical distribution that can shift over time, which affects performance and accuracy.

Anthropic's engineering team has written publicly about maintaining evaluation suites as living artifacts, with dedicated teams owning the infrastructure while domain experts contribute tasks and run the tests themselves.

2. Fallback and Escalation Logic

Every agent will encounter situations it cannot handle. The question is whether you've decided in advance what happens next, or whether the agent improvises.

Fallback logic defines the boundaries. When confidence drops below a threshold, when a tool call returns unexpected data, when the task exceeds the agent's defined scope, the system needs a predetermined path. That path might route to a simpler deterministic process, a different model, or a human operator. Escalation logic layers on top of that by adding severity awareness.

Without explicit escalation tiers, every failure gets the same treatment, which means either everything gets flagged (and humans stop paying attention) or nothing does (and real problems slip through). The organizations successfully scaling agents build these paths before deployment, treating them as load-bearing architecture.

3. Monitoring for Drift

AI agents degrade quietly. Model updates, shifts in input data, changes to upstream APIs, seasonal variation in user behavior. Any of these can erode agent performance without triggering a single error.

Drift monitoring tracks the gap between how your agent performed when you validated it and how it performs now. This includes statistical monitoring of output distributions, latency tracking across individual tool calls, and automated quality scoring against baseline benchmarks. In practice, effective drift detection requires capturing baseline metrics during your evaluation phase and then running the same scoring pipeline against production traffic on an ongoing basis. When scores diverge from your baseline by more than an acceptable margin, you have a concrete signal to investigate rather than a vague feeling that things seem off.

4. Human-in-the-Loop Checkpoints

Full autonomy sounds efficient until you realize what it costs when the agent is wrong. Human-in-the-loop checkpoints create structured moments where a person reviews, approves, or redirects agent output before it reaches the end user or triggers a downstream action.

The design challenge is placement. Too many checkpoints and you've built an expensive autocomplete system. Too few and you've handed off accountability to a system that can't actually hold it. The right approach maps checkpoints to consequence.

Low-risk, reversible actions can run autonomously. High-stakes decisions, anything involving money, legal exposure, or customer-facing commitments, need a human gate. As agents take on more complex workflows, these checkpoints also become your training data pipeline. Every human correction is a signal about where the agent needs improvement, but only if you're logging it (which brings us to the next point).

5. Logging for Auditability

When an agent makes a decision, you need to be able to reconstruct exactly how it got there. Full execution logging captures the chain of reasoning, tool invocations, retrieved context, intermediate outputs, and final actions across every run.

This serves three purposes simultaneously:

First, debugging. When something goes wrong, you need the trace, not a guess.

Second, compliance. Regulated industries require demonstrable decision trails, and even unregulated ones are moving in that direction.

Third, improvement. Logged executions become the dataset you use to identify failure patterns, tune prompts, and build better evaluation suites.

The tooling for this has matured significantly. OpenTelemetry-based tracing, structured span capture, and production replay capabilities now exist across multiple frameworks. The infrastructure cost is low relative to the cost of operating an agent you cannot inspect.

6. A Defined Handoff Protocol

Agents rarely operate in isolation. They pass work to other agents, to human operators, to downstream systems, and occasionally back to the user. Every one of those transitions is a potential failure point.
A handoff protocol specifies what information transfers with the task, what context the receiving party needs, what constitutes a successful handoff versus a dropped one, and who owns the outcome after the transition.

This gets more complex in multi-agent systems where one agent's output becomes another agent's input. If the first agent summarizes a customer issue and strips out a critical detail before passing it along, the second agent makes a decision on incomplete information. Neither agent has failed individually, but the system has failed completely.

Without this kind of structural clarity, you get the agent equivalent of a game of telephone. Context gets lost between steps, responsibilities blur, and when something fails mid-workflow, nobody can pinpoint where.

The Management Layer You Can't Skip

These six elements share a common thread. They're all infrastructure that exists to manage the agent after it's built.

The agent itself, the model, the prompts, the tool integrations, that's maybe 40% of what a production deployment actually requires.
The other 60% is the system that keeps the agent honest, visible, and recoverable when things go sideways.

Organizations that treat agent deployment as a build-and-ship exercise will spend the next six months doing manual cleanup on failures they could have prevented. The ones that invest in this management layer first will find that their agents get better over time instead of quietly getting worse.

The technology is mature enough. The question is whether your operational infrastructure is ready to match it.

…

Nick Talwar is a CTO, ex-Microsoft, and a hands-on AI engineer who supports executives in navigating AI adoption. He shares insights on AI-first strategies to drive bottom-line impact.

→ Follow him on LinkedIn to catch his latest thoughts.
→ Subscribe to his free Substack for in-depth articles delivered straight to your inbox.
→ Watch the live session to see how leaders in highly regulated industries leverage AI to cut manual work and drive ROI.

Top comments (1)

Harjot Singh • May 31

Things your agents need that you're probably not building is the right lens, because the gap between an agent demo and an agent product is almost entirely this unglamorous list, the stuff that's invisible on the happy path and load-bearing everywhere else. The ones I'd bet are on your list (and that teams consistently skip): observability so you can debug an opaque chain of decisions, validation between steps so a bad output doesn't silently poison the next, bounded loops and a kill switch so a confused agent can't burn budget forever, durable/resumable state so a crash doesn't restart from zero, idempotent side effects so retries don't double-charge, and human gates on the irreversible actions. None of these make the agent smarter, all of them make it trustworthy, and trustworthy is what ships. The throughline is that you're building the harness around the model, the model is the easy part now, and the reason people don't build these is they don't show up until production bites, by which point it's an incident, not a backlog item. Build the boring 80% on purpose, before it's an outage. That the-unbuilt-stuff-is-the-real-product instinct is core to how I think about Moonshift. Of your six, which one do you find teams resist building until it burns them, the validation layer, or the state/recovery?