milan

Posted on Jun 23

What Happens When Your AI Agent Gets Stuck in Production?

#ai #machinelearning #devops #opensource

The most expensive AI agent failures I've seen weren't model failures.

They were silent failures.

The agent looked healthy. The workflow was still running. Tokens were still being consumed.

But the agent had already stopped making meaningful progress.

Over time I ran into the same production issues repeatedly:

Infinite loops
Retry storms
Silent stalls
Tool failures hidden behind successful responses
Agents drifting away from the original goal
No visibility into what the agent was actually doing

A better prompt never fixed these problems.

The solution ended up being a runtime supervision layer around the agents rather than more workflow logic.

The Problem

Most agent frameworks focus on getting agents to run.

Production teams care about different questions:

Why is this execution stuck?
Is it still making progress?
Can I safely pause it?
Can I resume it later?
Should I terminate it entirely?

Those questions become difficult when the runtime only exposes logs.

Runtime Supervision

One design decision that worked well was separating supervision from agent logic.

Instead of embedding every guardrail directly inside the workflow graph, a dedicated runtime layer observes execution and enforces operational rules.

This keeps agent workflows simple while allowing supervision logic to evolve independently.

The runtime is responsible for:

Loop detection
Retry management
Budget enforcement
Pause and resume operations
Execution checkpoints
Stop reason classification
Live telemetry

The result is a system where operational concerns can change without requiring modifications to agent behavior.

Explicit Stop Reasons

One lesson I learned quickly:

"Failed" is not a useful status.

Execution stops should explain themselves.

Examples:

LOOP_DETECTED
BUDGET_EXCEEDED
RETRY_LIMIT_REACHED
TOOL_FAILURE
TIMEOUT
USER_PAUSED
USER_KILLED

The recovery path depends on why the execution stopped.

Without that information operators are forced to guess.

Semantic Loop Detection

Most loop detection implementations use step counts.

The problem is that agents can make progress on the wrong objective without technically looping.

An execution might spend twenty steps confidently pursuing a plan that diverged from the original goal.

What worked better was periodically asking:

"Are we meaningfully closer to the goal than we were several steps ago?"

This catches drift before it becomes expensive.

Pause vs Kill

These are not the same operation.

Pause

Pause preserves execution state.

Execution stops, but the runtime keeps the latest checkpoint.

Resume simply loads the last committed state and continues.

Kill

Kill terminates execution completely.

Active state is removed and the execution cannot continue.

The distinction becomes important when debugging long-running workflows.

Checkpoint Before Action

Before every external action:

API calls
Browser interactions
Email delivery
Database writes

the runtime creates a checkpoint.

Successful execution clears the checkpoint.

If the process crashes, the next execution immediately knows what was in flight.

This turned silent failures into recoverable failures.

Retry Storm Protection

One failed dependency can create thousands of wasted requests.

The pattern that worked best was:

Exponential backoff
Retry budgets
Circuit breakers

Without all three, agents tend to fail repeatedly and burn tokens while making no progress.

Live Telemetry

Logs tell you what happened.

Operators usually need to know what is happening right now.

The runtime continuously tracks:

Current task
Current step
Active tool
Execution status
Recent transitions

The goal is to make agent execution observable while it is running, not after the incident has already happened.

Final Thoughts

Building AI agents is becoming easier every month.

Building agents that can survive production failures is still difficult.

The most important lesson I learned is that reliability problems usually appear outside the model.

They appear in retries, checkpoints, tool failures, execution control, and supervision.

What has been the hardest production failure you've encountered while running AI agents?

Top comments (10)

Mallory Haigh • Jul 13

This is all agent infrastructure, not per-agent logic! The moment you pull the concerns you've described out of the workflow graph and into a dedicated runtime layer, you've started building a platform, whether or not you call it that.

The three things that make this work are the same three things that pop up as solutions regularly when scaling agents: a harness that handles execution within a turn, engineering loops that govern between turns (which is exactly what semantic drift detection and retry budgets are), and a governance plane that owns identity, observability, and security across every agent running on the substrate.

milan • Jul 13

Thanks! That's exactly the direction I'm aiming for. My goal with AgentPulse is to move runtime concerns like monitoring, guardrails, retries, and governance out of individual agents and into a shared infrastructure layer, so teams can focus on building agent logic while the platform handles production reliability. I appreciate your insights!

Alex Shev • Jun 23

Stuck agents need operational controls, not only better prompts. I would want timeout budgets, loop detection, kill switches, partial-state reporting, and a clear handoff path when the agent cannot finish safely.

milan • Jun 23

Agreed. The handoff path is an important piece that's often overlooked.

Most systems focus on success and failure states, but there's a third state where the agent isn't failing yet can't confidently continue.

That's where timeout budgets, loop detection, pause/kill controls, and human handoff become valuable. In my experience, the hardest production incidents aren't crashes—they're agents that keep running while no longer making meaningful progress.

Alex Shev • Jun 23

Exactly. The stuck-but-not-crashed state is where a lot of production agent risk lives. It needs its own control path: loop detection, timeout budgets, progress signals, and a clean way to hand the run back to a human before the system burns time or trust.

milan • Jun 24

Exactly. Traditional monitoring catches crashes fairly well, but the harder problem is detecting executions that are still running while no longer creating value. That's why I started treating execution state and progress signals as first-class runtime concerns rather than workflow concerns.

Nazar Boyko • Jun 24

The semantic loop detection is the part I keep thinking about. Counting steps only catches an agent spinning in place, but the expensive failure is the one that looks busy and productive while drifting off the actual goal. Asking whether you're closer than you were ten steps ago is a nice way to catch that, though it leans on having a goal you can measure progress against, which isn't always easy to define. The other keeper is taking a checkpoint before every outside action, since that's what turns a silent stall into something you can actually resume from.

milan • Jun 24

I completely agree. The challenge with semantic loop detection is defining progress in a way that works across different tasks. In practice I've found that trajectory drift is often more expensive than obvious loops because the agent still appears productive. Checkpointing before external actions became important for the same reason it turns silent failures into recoverable ones instead of forcing a full restart.

Richard Smith • Jun 25

I like the idea of treating execution state as a first-class runtime concern. It shifts the question from "is it running?" to "is it actually doing what it should?"

milan • Jun 25

Exactly. "Running" is only one part of the picture. The harder question is whether the execution is still making meaningful progress toward its goal. That's the distinction we've been focusing on while building the runtime.