How I made AI agents safe to run on real infrastructure

#agents #ai #infrastructure #security

Draft flagship post — publish on your blog (canonical), then cross-post to dev.to / Hashnode and as a LinkedIn article. ~1,100 words. This is your single strongest, most differentiated story — it's what makes a hiring manager think "this person actually gets agent reliability."

Everyone can get an LLM agent to do something impressive in a demo. Far fewer can get one to act on live infrastructure — your machines, terminals, files, deployments — without occasionally doing something catastrophic.

That gap is the whole problem. And it's not a model problem. The model is the easy part now. The hard part is making an agent's actions trustworthy enough that you'd let it run unattended against systems that matter.

I built Cmdop around exactly this problem. Here's the architecture and, more importantly, the reliability loop that turned it from "a demo that works once" into a runtime I actually trust.

The setup

Cmdop is an agent → gRPC → server → SDK platform. Agents act on remote machines through a multi-agent runtime: they hand off to each other, call tools, and operate under human-in-the-loop control where it matters. Developers integrate the primitives directly through Node.js, Python, and React SDKs, and the whole thing runs thousands of concurrent agent sessions over persistent gRPC/WebSocket streams.

None of that is the interesting part. Plenty of systems can route an LLM's output to a shell. The interesting part is what happens around every single tool call.

Why "it looked right" isn't good enough

A plausible-looking action is the most dangerous thing an agent produces. rm -rf ./logs and rm -rf /logs look almost identical and differ by a catastrophe. An agent that's "usually right" is, on real infrastructure, a system that will eventually take down production confidently.

So I stopped treating agent output as something to execute and started treating it as something to verify, score, and constrain — before, during, and after execution.

The eval / instrumentation loop

Every tool call in Cmdop runs through the same loop:

Structured-output contract, validated before execution. The agent doesn't emit free text that I parse hopefully. It emits a structured contract — intended action, parameters, expected effect — and that contract is validated against a schema before anything runs. Malformed or out-of-policy calls never reach the system.
Full trace, logged. Prompt, tool call, result, latency, and which retry/failover path it took — all captured. You cannot improve what you cannot see, and agent failures are subtle: the run "succeeded" but did the wrong thing.
Scored on the axes that matter. Not "did it return 200." Each run is scored on tool-call validity, task success, and — the one everyone forgets — unintended side-effects. The side-effect score is what catches the agent that completed the task and deleted something it shouldn't have.
Guardrails + automatic retry/failover where evals expose brittleness. When the eval data showed a step was fragile, I didn't just log it — I added a structured guardrail and an automatic retry/failover path. Brittle steps became contained steps.

What the loop actually bought me

The payoff wasn't a dashboard. It was autonomy I could widen safely.

Before the loop, every meaningfully risky action needed a human in front of it, because I had no principled way to know which actions were safe to let run. After the loop, I had data: which tool calls were reliably valid, which tasks completed cleanly, which steps produced side-effects. That let me move actions from "human-in-the-loop required" to "autonomous" one measured step at a time, instead of guessing.

That's the real lesson, and it generalizes far beyond Cmdop: agent autonomy is earned through evaluation, not granted by confidence. The teams shipping agents into production aren't the ones with the best prompts — they're the ones who instrumented and scored agent behavior until they knew, with data, where autonomy was safe.

Things I'd tell anyone building agents for production

Make tool calls structured and typed. Enums over free text. Machine-readable errors that tell the agent how to recover, not just that it failed. An agent recovers from a typed error; it flails on a stack trace.
Score side-effects, not just success. "The task completed" and "the task completed without breaking anything else" are different measurements. Only the second one keeps you in business.
Design for idempotency and retries. If an action can safely run twice, the agent can retry safely. If it can't, you've built a system where a network blip becomes data loss.
Treat docs and tools as an API for agents. (This is also why I built DjangoCFG with an MCP server — so coding agents query a framework's real capabilities and schemas directly instead of scraping prose.)
Keep the human in the loop until the data says you can remove them. Then remove them one step at a time.

Where this is going

The interesting frontier in agentic AI right now isn't bigger models — it's the reliability layer: evals, guardrails, observability, the harness around the agent. That's the part that decides whether agents stay demos or become infrastructure. It's also, conveniently, the part I find most interesting to build.

If you're working on agent reliability, agent platforms, or making AI safe to run against real systems — I'd love to compare notes.

I'm Mark K. (Igor Korotin), a Principal Product Architect / Technical CPO building applied-AI platforms. More at cmdop.com and djangocfg.com, code at github.com/markolofsen.