Réda Zerrad

Posted on Mar 10

Your AI Agent Looks Fine in Staging. Production Is a Different Story.

#ai #security #monitoring #devops

I've spent 15+ years building enterprise security infrastructure. SSO, SCIM provisioning, zero-trust networking, AI-powered threat detection. The kind of systems where a failure at 2 AM means someone's getting paged and something important is broken.

Over the past year, I've watched a pattern repeat itself across engineering teams building with AI agents. The pattern goes like this: agent works great in development, passes all the evals, gets shipped to production, and then quietly starts doing things nobody expected.

Not crashing. Not throwing errors. Just... drifting.

The problem nobody talks about

Traditional monitoring tools are designed for deterministic systems. A request comes in, code executes, a response goes out. If something breaks, you get a stack trace. You know exactly what happened and where.

AI agents don't work that way. They make decisions. They chain together multiple LLM calls, pick tools, reason through multi-step workflows, and produce outputs that can vary every single time. When something goes wrong, your logs show HTTP 200 across the board. Everything looks healthy. But your agent just confidently gave a customer the wrong answer, or took an action it was never supposed to take, and you have no trace of the reasoning that led there.

I've seen this firsthand with teams running agents in production. The failure mode isn't a crash. It's a slow behavioral shift that looks like success until someone notices the downstream impact.

Why standard observability falls short

If you're running agents in production today, you're probably using some combination of application performance monitoring, log aggregation, and maybe some custom dashboards. Those tools are great at telling you that something happened. They're terrible at telling you why an agent made a specific decision.

Here's what's actually different about agent systems:

Non-deterministic execution. The same input can produce different outputs depending on model state, context window contents, and tool availability. You can't just replay a request and expect the same result.

Multi-step reasoning chains. A single user request might trigger 10 or 15 LLM calls across multiple models, each building on the output of the previous step. A subtle error in step 3 can cascade through the entire chain, and by step 12 you're looking at completely corrupted output with no obvious point of failure.

Semantic failures. The response is syntactically valid, grammatically correct, and formatted perfectly. It's also wrong. Traditional monitoring has no way to catch this because the system metrics all look normal.

Behavioral drift over time. Model updates, prompt changes, shifts in input distribution. Your agent's behavior today might be measurably different from its behavior last month, and you'd never know unless you're specifically tracking it.

What production agent monitoring actually requires

After building security systems for over a decade, I think about agent monitoring through the same lens I think about security: defense in depth. You need multiple layers, each catching a different class of failure.

Behavioral baselines and drift detection

You need to establish what "normal" looks like for your agent and continuously measure against that baseline. Not just latency and throughput, but the actual decisions the agent makes. Which tools does it call? How often does it escalate? What's the distribution of response types?

When that distribution shifts, even if every individual response looks reasonable, something has changed. You want to know about it before your users do.

Runtime guardrails, not just eval-time checks

Evals are important. But evals run before deployment. They test against a fixed set of scenarios. Production is where your agent encounters the inputs nobody anticipated.

Runtime guardrails enforce policies at execution time. They sit between the agent's decision and the actual action, checking whether the intended behavior falls within acceptable bounds. If an agent tries to access data it shouldn't, or takes an action that violates a business rule, the guardrail catches it before any damage is done.

This is the same principle as network firewalls. You don't just audit your code and hope for the best. You put controls at the boundary.

Cryptographic audit trails

In regulated environments, "we have logs" isn't good enough. Auditors want to know that logs haven't been tampered with. They want a chain of custody for every decision an agent made.

A cryptographic audit trail hashes each entry in a chain so that any modification to a historical record is detectable. This isn't just a compliance checkbox. It's the difference between "we think the agent did X" and "we can prove the agent did X, and nobody has altered that record."

Compliance reporting that doesn't require a separate workstream

If your agents handle any kind of sensitive data (and they probably do), you're going to face compliance requirements. SOC 2, HIPAA, GDPR, PCI, SOX. Each one has specific requirements around data handling, access controls, and audit capabilities.

Most teams treat compliance as a separate project that runs parallel to engineering. You build the thing, then you spend weeks generating evidence for auditors. That approach doesn't scale when you're shipping agents continuously.

The better approach is building compliance into the operational layer so that the evidence is generated automatically as a byproduct of normal operations.

The SDK approach vs. the platform migration trap

One thing I feel strongly about: you shouldn't have to rip out your existing agent stack to get monitoring and governance.

A lot of tools in this space require you to build on their orchestration layer, their workflow engine, their runtime. That creates vendor lock-in and forces a migration that most teams can't justify, especially when they already have agents running in production.

The alternative is an SDK-based approach. You keep your existing agent framework, whatever it is. You add instrumentation with a few lines of code. The SDK captures the telemetry, enforces the guardrails, and streams everything to a monitoring layer without changing how your agent actually works.

This matters because the teams that need monitoring most urgently are the ones that already have agents in production. They can't afford to stop and rebuild.

What to look for

If you're evaluating agent monitoring solutions (or building your own), here's the shortlist of capabilities I'd prioritize:

Real-time behavioral monitoring with drift detection that goes beyond latency and error rates into semantic quality tracking.

Runtime guardrails that can intercept and block agent actions before they execute, not just flag them after the fact.

Immutable audit trails with cryptographic integrity so you can prove what happened, not just assert it.

Multi-environment support so you can track agent behavior across development, staging, and production with promotion workflows.

Framework-agnostic integration that doesn't force you to migrate your existing orchestration.

Automated compliance reporting for whatever regulatory frameworks apply to your business.

The gap is closing, but it's still wide

The AI observability space has matured significantly. There are good tools for tracing, for evaluation, for cost tracking. But there's still a gap between "observability" and "operational governance." Knowing what your agent did is useful. Being able to prevent it from doing the wrong thing in real time, proving that to an auditor, and doing it all without rebuilding your stack... that's the harder problem.

I built NodeLoom because I kept hitting this gap myself. But regardless of what tool you use, the important thing is that you're thinking about this problem before your first production incident forces you to.

Your agents are making decisions on behalf of your business. You should be able to see those decisions, control them, and prove what happened. That's not optional once you're in production. It's the baseline.

I'm Reda Zerrad. I've been building enterprise security systems for 15+ years and I'm now focused on AI agent governance. If you're running agents in production and want to talk about monitoring challenges, I'd love to hear what you're dealing with. Find me on LinkedIn or check out what I'm building at nodeloom.io.

Top comments (1)

Armorer Labs • Jun 25

This maps closely to what I have seen as well: the hard failures are easier than the "looks healthy but drifted" cases.

One thing I would add to production monitoring is a replayable run receipt, not just traces. For each agent run, I want to know the prompt/config version, model/provider version, tools mounted, tool calls attempted, approvals or policy gates hit, state/memory used, eval checks run, and what could not be verified.

The receipt does not have to expose every token to every reviewer, but it should make the run reconstructable enough that an operator can answer: did the agent behave outside its intended envelope, or did our envelope allow the wrong thing?