When Agents Won't Stop: Diagnosing and Fixing Infinite Execution Loops in Nexus Core AI OS
An engineering post-mortem on runaway agent behavior, control flow failures, and the debugging work that led to structured loop prevention.
1. Introduction: What Infinite Agent Loops Actually Look Like in Production
An infinite loop in a traditional program is easy to spot: CPU pegs at 100%, the process hangs, and a stack trace points directly at the offending call. In an agentic AI system, the failure mode is far subtler and significantly more expensive.
In Nexus Core AI OS, agents are designed to pursue goals through iterative reasoning cycles — observe context, select a tool, execute, evaluate the result, and decide on the next step. This loop is not a bug; it is the intended execution model. The problem emerges when the termination condition never fires. The agent doesn't freeze. It keeps working. It calls tools, generates reasoning text, re-evaluates its position, and then — because its internal state hasn't materially changed — arrives at the same decision it made three cycles ago. It calls the same tool again. The cycle repeats.
Unlike a spinning thread, this kind of loop burns tokens, hits rate limits, modifies external state (sometimes repeatedly), and can silently consume significant compute before any monitoring threshold trips an alert. By the time a human notices something is wrong, the agent may have completed forty or sixty reasoning cycles, each one generating plausible-looking output.
This article documents how we identified, diagnosed, and fixed this behavior in Nexus Core. It covers the debugging process in detail — the traces we pulled, the patterns we found, the guardrails that failed, and the control flow redesign that eventually stabilized production behavior.
2. How We First Noticed the Looping Behavior in Production
The first signal wasn't an alert. It was a billing anomaly.
During a routine review of token consumption logs, one of our engineers noticed a specific task class — multi-step file analysis jobs — was generating token counts roughly eight times higher than our p95 baseline. The jobs were completing successfully (returning a final answer), so no error had been thrown. But the cost delta was impossible to explain through normal variance.
We pulled the raw execution logs for three of the flagged jobs. The first thing we noticed was duration: tasks that should take four to six reasoning cycles were running twenty to thirty. The second thing we noticed was that the tool call sequences were not random. They were repetitive in a way that felt almost rhythmic — read_file, analyze_content, search_context, read_file, analyze_content, search_context — with minor variations in parameters but no meaningful change in the outputs being produced.
The agent was not stuck in the traditional sense. It was reasoning. Each cycle produced coherent text. But the reasoning was circular — the agent kept identifying the same gap in its understanding, attempting the same retrieval to fill it, failing to recognize that the retrieval had already been attempted, and re-queuing the same sequence.
We had not built tooling to detect this pattern. We were not looking for it. It was only visible in aggregate, through cost data, and only then because someone looked carefully at the per-task distributions.
That was the first problem to fix: we needed observability before we could fix execution behavior.
3. Symptoms: Runaway Tool Calls, Repeated Reasoning Cycles, Stuck States
Once we knew what to look for, we found the pattern in roughly 3–7% of long-horizon tasks, depending on task complexity and the ambiguity of the goal specification. The symptoms fell into three categories.
Runaway tool calls. The agent would issue bursts of tool calls within a single reasoning cycle — sometimes ten or fifteen — because it was attempting to resolve an ambiguity through brute-force retrieval rather than recognizing that the ambiguity was structural (i.e., the information it needed did not exist in the available sources). No single tool call failed. Each returned a result. But the aggregate information gain from the burst was near zero, and the agent's next reasoning step treated it as insufficient and re-issued the same queries with minor paraphrastic variation.
Repeated reasoning cycles. More common than burst behavior was the slow loop: each reasoning cycle looked normal in isolation, but across cycles the agent was converging on the same plan, executing a subset of it, finding the result unsatisfying, and re-planning. The re-plans were not identical but were semantically equivalent. The agent was rephrasing its intentions without changing them.
Stuck states. A smaller but more serious subset of cases involved genuine fixation. The agent would identify a specific sub-goal — for instance, "confirm that file X contains field Y" — and become unable to exit that sub-goal even after successfully confirming the fact. Each new reasoning cycle would re-examine the confirmation as if it were uncertain, produce another tool call to re-verify, and add the re-verification result to a context window that was steadily filling with redundant information. These cases were the most expensive and the most damaging, because the growing context degraded the quality of subsequent reasoning steps.
4. Why Early Guardrails Failed
We had guardrails. They didn't work.
The first line of defense was a hard token limit per task. When a task's total token consumption crossed a ceiling, the agent was terminated and a failure result was returned. This worked in the sense that it prevented infinite spend — but it produced silent failures. The calling system received an error, retried the task (as designed), and often re-entered the same loop from the beginning. In some cases, our retry logic was actively making the problem worse.
The second guardrail was a cycle counter: a maximum number of reasoning iterations per task. When the limit was hit, the agent was again terminated. Same problem: the limit was set conservatively high to avoid interrupting legitimate long-horizon tasks, so by the time it fired, the damage was already significant. And again, the retry path didn't account for the possibility that the task itself, not the execution environment, was the cause of the loop.
We also had timeout-based termination. A task running longer than N seconds was killed. This caught some of the worst cases but was unreliable because the loop's speed varied significantly with tool latency. Under normal conditions a stuck agent might time out in minutes; under high-load conditions with slow tool responses, the same agent might trip the cycle counter first.
None of these guardrails addressed the root cause. They were exit conditions, not loop detectors. They could terminate a looping agent, but they could not identify why it was looping, prevent it from re-entering the same state on retry, or distinguish a stuck agent from a legitimately slow but progressing one.
The deeper failure was architectural: we had designed termination conditions but had not designed execution checkpoints. We knew when to stop, but we had no mechanism for asking whether meaningful progress was being made.
5. Root Cause Analysis: Why the Agent Kept Re-Entering the Same Decision State
After pulling execution traces from forty confirmed loop cases, a consistent causal pattern emerged.
The agent's decision model was stateless between reasoning cycles in a specific and problematic way. Each cycle received the full conversation history as context — so it had access to prior tool calls and results — but it had no structured representation of what it had already tried. The history was there, but reading and interpreting that history was itself part of the reasoning task. When the agent's context window grew large and the reasoning prompt became complex, the model's ability to accurately recall and account for prior attempts degraded.
This meant that as a task grew longer and more complex — exactly the conditions under which loop prevention mattered most — the agent's natural loop-detection ability (reasoning about what it had already done) became less reliable. The problem was self-reinforcing: the longer the loop ran, the larger the context, the harder it was for the agent to recognize the loop.
A secondary factor was goal ambiguity. Several of the stuck-state cases involved tasks with underspecified success criteria. The agent could not determine when it had completed the task because "completion" had not been defined precisely enough. Rather than declaring completion under uncertainty, the agent defaulted to continued investigation. This is arguably correct behavior in isolation, but it meant that ambiguous tasks were systematically more likely to loop.
The third factor was tool result interpretation. In several cases, the agent was receiving tool results that it expected to contain specific information, not finding that information, and concluding that the tool call had "failed" in some semantic sense — even when the tool had returned a valid, well-formed result. It would retry the call with a rephrased query, get back a similar result, and interpret that similarly. The retry logic was baked into the agent's reasoning, not the tool layer, which meant our tool-layer retry limits didn't catch it.
6. The Debugging Process: Logs, Tracing, Execution Graphs
Our existing logging infrastructure captured inputs, outputs, and latencies at the tool call level, but it did not capture reasoning-cycle-level structure. We could see that tool X was called at time T, but we could not easily see which reasoning cycle had generated that call, what the agent's stated intent was at that point, or how the result had been incorporated into the next cycle.
We instrumented the execution framework to emit structured events at cycle boundaries. Each event captured: cycle index, a short hash of the agent's current working state representation, the set of tool calls issued in the cycle, a brief summary of the agent's stated next action (extracted from a structured output field we added to the reasoning prompt), and a flag indicating whether the cycle had produced any new information relative to the prior cycle.
That last field — new information delta — required defining what "new information" meant in a computable way. We settled on a pragmatic proxy: if all tool calls in a cycle returned results that were byte-for-byte identical to results returned in any prior cycle within the same task, the delta was zero. This was a coarse approximation, but it was cheap to compute and captured the majority of true-positive loop cases.
With this instrumentation in place, we built execution graphs: directed graphs where each node was a reasoning cycle and each edge represented a tool call chain. Loops in the execution graph became visually obvious. Tasks that progressed normally looked like linear chains with occasional branching. Tasks that looped produced tight clusters of near-identical nodes connected by repeated edges.
We built an internal tool to render these graphs from log data. Within a few days of instrumenting production, we had a clear visual taxonomy of loop shapes: tight two-cycle loops, slow drifting loops where the agent made minor variations across five or six cycles, and the stuck-state cases where a single node effectively had a self-loop with cosmetic variation in edge labels.
The graphs also revealed something we hadn't anticipated: many loops weren't infinite in the theoretical sense. They were bounded by the token limit we'd already set. The agent would loop for twenty-five or thirty cycles, hit the limit, terminate, and be retried. Without the execution graph, we had categorized these as task failures due to complexity. With the graph, we could see they were loops that just happened to terminate before becoming obviously infinite.
7. The Fix: Introducing Structured Control Flow and Execution Checkpoints
The core architectural change was introducing execution checkpoints — explicit synchronization points where the agent was required to produce a structured progress report before proceeding to the next reasoning cycle.
The checkpoint schema was minimal by design:
{
"cycle_index": 14,
"goal_status": "in_progress",
"completed_sub_goals": ["read target file", "identify schema fields"],
"pending_sub_goals": ["validate field types"],
"tools_called_this_cycle": ["read_file"],
"information_gained": true,
"estimated_cycles_remaining": 3
}
This schema served two purposes. First, it forced the agent to explicitly represent its understanding of task progress in a structured, machine-readable form. Second, it gave the execution framework something to inspect programmatically, without needing to parse free-form reasoning text.
The execution framework read the checkpoint after each cycle. If information_gained was false for two consecutive cycles, the framework would inject a prompt into the next cycle explicitly flagging the pattern and asking the agent to either identify why the retrieval was failing or declare the sub-goal unachievable. If information_gained was false for three consecutive cycles, the framework would terminate the current sub-goal forcibly and require the agent to proceed with whatever information it had.
This was not a clean solution. Forcibly advancing past a stuck sub-goal sometimes produced lower-quality final answers. But it was far preferable to a loop that consumed thirty cycles and still produced no answer.
The estimated_cycles_remaining field was used to dynamically adjust the cycle budget. Tasks where the agent consistently estimated low remaining cycles got a tighter ceiling; tasks that appeared to be making steady progress got more room. This reduced false terminations on legitimately complex tasks while tightening the noose on loops.
8. Adding Loop Detection Heuristics: State Hashing, Repetition Tracking, Depth Limits
Checkpoints gave us a hook for intervention, but we also built a parallel detection layer that didn't depend on the agent's self-reporting.
State hashing. At each checkpoint, we computed a hash of the agent's current working state: the set of sub-goals marked complete, the set of tool results received, and the current pending sub-goal. We maintained a rolling window of the last ten state hashes. If the current hash appeared in the window, we flagged it as a detected loop. The hash function had to be tolerant of minor variation — we used locality-sensitive hashing rather than cryptographic hashing to catch near-duplicate states, not just exact duplicates.
def compute_state_hash(checkpoint: CheckpointSchema) -> str:
state_vector = (
frozenset(checkpoint.completed_sub_goals),
checkpoint.pending_sub_goals[0] if checkpoint.pending_sub_goals else None,
frozenset(r.tool_name for r in checkpoint.tool_results_this_cycle)
)
return simhash(str(state_vector), hashbits=64)
Repetition tracking. Separately from state hashing, we tracked tool call sequences at the execution layer without involving the agent's reasoning at all. The execution framework maintained a per-task log of (tool_name, parameter_hash) tuples and checked each new tool call against the log. A tool called with identical parameters more than twice within the same task triggered a warning; more than three times triggered an automatic block and injected a note into the agent's context explaining that the call had already been made and what the result was.
This was the most reliably effective intervention we implemented. Many loop cases involved the agent re-calling the same tools because it hadn't retained or properly processed the prior result. Blocking the redundant call and surfacing the prior result broke the loop in roughly 60% of cases without requiring any other intervention.
Depth limits. For tasks that decomposed into sub-tasks, we implemented depth limits on the decomposition tree. An agent that recursively broke a goal into sub-goals, and those sub-goals into further sub-goals, could create arbitrarily deep call trees that were difficult to reason about and expensive to execute. We capped decomposition depth at four levels and required any agent attempting to exceed that depth to instead reformulate the problem as a flat sequence of steps.
9. How We Redesigned Execution Flow to Prevent Recursion Traps
The deeper structural issue was that our agent execution model allowed unrestricted recursion. An agent could invoke a sub-agent to handle a sub-task, and that sub-agent could invoke further sub-agents. In theory, this enabled elegant task decomposition. In practice, it created recursive call chains that were difficult to trace, impossible to checkpoint uniformly, and prone to exponential token consumption if any node in the chain looped.
We redesigned the execution model to enforce a strict two-tier structure: a coordinator agent and worker agents. The coordinator could not recurse. It could delegate sub-tasks to workers, but workers could not spawn further workers — they could only call tools. If a worker's task required decomposition, it returned a "decomposition required" signal to the coordinator, which then planned the decomposition explicitly.
Coordinator
├── Worker A (tool calls only, no sub-agents)
├── Worker B (tool calls only)
└── Worker C (tool calls only)
└── [If C needs decomposition, returns signal to Coordinator]
Coordinator re-plans
├── Worker D
└── Worker E
This structure eliminated the entire class of deep recursion bugs. It also made execution graphs dramatically simpler to reason about: the graph was always at most two levels deep, and loops at the coordinator level were easy to detect because the coordinator's decision set was small and well-defined.
The trade-off was expressiveness. Some tasks that had been elegantly handled by recursive decomposition now required explicit multi-stage planning from the coordinator. This added latency in cases where the recursion had been terminating correctly. We accepted this trade-off because the failure cases of unconstrained recursion were too expensive to tolerate.
We also introduced explicit state passing between coordinator and workers. Rather than relying on workers to infer task context from conversation history, the coordinator passed a structured context object containing: the overall task goal, the specific sub-task being delegated, constraints on execution (cycle budget, allowed tools), and a summary of information already gathered. Workers were prohibited from re-gathering information already present in the context object.
10. Performance Impact After Introducing Controls
The interventions had measurable effects on both failure rates and nominal performance.
Loop incidence (tasks requiring more than 2x the p50 cycle count) dropped from approximately 5.4% to 0.8% within three weeks of deploying the full control stack. The remaining 0.8% were almost entirely in the "ambiguous goal specification" category — tasks where the success criteria were genuinely unclear, which is a prompt engineering problem rather than an execution control problem.
Token consumption per task at p95 dropped by approximately 34%. This was larger than expected, suggesting that a significant portion of our p95 tail had been dominated by looping behavior that we hadn't fully characterized.
Nominal task latency increased by 8–12% on non-looping tasks. The checkpoint generation overhead, state hashing, and repetition-tracking lookups added real work to each cycle. We reduced this partially by making checkpoint generation asynchronous with respect to tool execution — the checkpoint for cycle N is written during the tool execution phase of cycle N+1 rather than blocking cycle N+1's start. This recovered about half the latency overhead.
Task success rate (defined as returning a valid final answer within budget) improved from 91.2% to 96.7%. The improvement came from two sources: fewer loops consuming entire budgets, and the forced-advancement mechanism converting stuck-state failures into lower-quality-but-valid completions.
The forced-advancement mechanism deserves scrutiny here. Approximately 2% of tasks now complete via forced advancement — meaning the agent was stuck and was pushed past the blockage by the framework. The quality of these completions is measurably lower than unassisted completions on the same task types. We track this separately and are working on better disambiguation prompts to reduce the stuck-state incidence rather than papering over it with forced advancement.
11. Edge Cases That Still Challenge Loop Prevention
The controls we've described handle the majority of loop cases well. Several categories remain difficult.
Semantic loops without syntactic repetition. Our repetition tracker catches identical tool calls. It does not catch cases where the agent calls tool A to get information X, then calls tool B to get information X from a different source, then calls tool C for the same purpose. The tool calls are distinct; the intent is identical. State hashing partially catches this if the sub-goal representation is stable, but agents sometimes rephrase their pending sub-goals in ways that defeat the similarity threshold.
Goal drift during long tasks. In tasks running more than fifteen cycles, we observe a phenomenon where the agent's interpretation of the original goal shifts subtly across cycles. It begins pursuing a slightly different objective than the one specified. When this drift leads the agent toward an achievable goal, it's harmless or even beneficial. When it leads toward an unachievable or poorly defined goal, the agent loops. Detecting goal drift requires comparing the agent's current stated objective against the original task specification semantically, which is expensive and error-prone.
Tool failures that look like successes. Several loop cases involve tools that return plausible-looking but incorrect results — for instance, a search tool that returns results for a slightly different query than the one issued. The agent trusts the result, acts on it, finds the action doesn't resolve its goal, and retries. From the agent's perspective, each attempt is based on new information. From the outside, the agent is looping. Detecting this requires ground-truth validation of tool outputs, which is only possible in domains where we can define correctness externally.
Coordinator-level planning loops. Our two-tier architecture prevents recursive worker spawning, but the coordinator itself can loop if it repeatedly decomposes a goal the same way, finds the decomposition fails, and re-decomposes identically. The coordinator's state space is larger and more complex than a worker's, making state hashing less reliable. We've added explicit plan deduplication at the coordinator level — a plan is blocked if it is structurally identical to a previously attempted plan — but structural comparison of plans is non-trivial.
Interaction effects between loop prevention and valid retry logic. Our tool-layer repetition block prevents the same tool call from being issued more than three times. But some tools are legitimately called multiple times with the same parameters — polling a status endpoint, for example. We've added a tool-level annotation system that marks specific tools as "retry-safe" and exempts them from the repetition block, but maintaining that annotation set adds operational overhead and is prone to omissions.
12. Final Thoughts: Building Predictable Agent Behavior Without Over-Restricting Intelligence
The instinct when debugging pathological agent behavior is to add restrictions: tighter limits, harder caps, more aggressive termination. This instinct is wrong, or at least incomplete. Restrictions that are too aggressive simply transform the failure mode from "infinite loop" to "premature termination" — a different failure, not a solved problem.
What we actually needed was structured observability and structured intervention. The agent needed to externalize its progress representation in a machine-readable form so the execution framework could reason about it. The execution framework needed to distinguish between an agent that was working hard and an agent that was spinning. And both needed to be able to communicate: the framework surfacing detected patterns to the agent, the agent updating its behavior in response.
The checkpoint mechanism was the most important single change we made. Not because it caught loops directly, but because it created a shared representation of execution state that both the agent and the framework could act on. Without it, we were in the position of trying to detect loops by watching the agent from the outside, with no way to ask it what it thought was happening.
The other lesson is that loop prevention needs to be designed in, not bolted on. Our early guardrails (token limits, cycle counts, timeouts) were bolt-ons — exit conditions added after the fact to bound the damage from a failure mode we hadn't fully anticipated. They worked poorly because they weren't integrated with the agent's reasoning process. The controls that worked well were woven into the execution loop itself: checkpoints, state hashing, and repetition tracking that ran in lockstep with agent execution and could influence the next cycle's inputs.
We have not eliminated all loop cases. The edge cases described in section 11 remain active areas of work. But we have reduced the failure rate to a level where individual cases are debuggable rather than endemic. Each new loop case we encounter is now a data point that refines our detection heuristics rather than evidence that the system is fundamentally unpredictable.
Predictable agent behavior doesn't mean constrained or cautious behavior. It means building the scaffolding that allows the agent's reasoning to remain effective over long execution horizons — and intervening precisely when it stops being effective, rather than bluntly when it runs too long.
Nexus Core AI OS is an internal agentic execution framework. Metrics cited are from internal production telemetry. All code samples are illustrative and simplified.







Top comments (0)