Why your AI agent code turns into spaghetti — and how to untangle it

#ai #agents #python #debugging

The 3am pager that changed how I write agents

A few months back, I shipped what I thought was a clean agent for a client. It scraped web pages, summarized them, then routed the results to different downstream tools based on content. Worked great in dev. Worked great for the first week.

Then I got paged at 3am.

The agent had gotten into a loop. One of the tools timed out, returned a partial response, the LLM "decided" the task wasn't done, called the same tool again, got another partial response, and so on. By the time I caught it, we'd burned through about 4,000 API calls overnight.

The fix wasn't fun. The agent logic was scattered across if statements, retry decorators, prompt templates, and a while loop that was supposed to terminate when the LLM said "DONE". Spoiler: it sometimes did not say DONE.

Root cause: imperative code + stochastic calls = chaos

The mistake I keep seeing (and keep making) is treating an LLM call like any other function. It's not. A regular function returns deterministic output for given input. An LLM call returns probable output, and that output drives control flow.

When you mix:

imperative control flow (if/else, while, recursion)
stochastic decisions (the model "decides" the next step)
side effects (tool calls, DB writes, API requests)

...without any structural boundary between them, you get code where you can't reason about termination, retries, or partial state.

Here's the kind of thing I'm talking about:

def run_agent(task):
    history = [{"role": "user", "content": task}]
    while True:  # the footgun
        response = call_llm(history)
        history.append(response)
        if "DONE" in response["content"]:
            return response
        if response.get("tool_call"):
            result = execute_tool(response["tool_call"])
            history.append({"role": "tool", "content": result})
        # if neither branch hits, we loop forever

The model is the loop variant and the body. There's no separation between "what step am I in?" and "what does the model want next?". If the model gets confused, your program gets confused.

Step 1: separate the planner from the executor

The first refactor that actually helped: split the model's role into two distinct jobs, and never let them run in the same loop.

# Planner: produces a static plan from the task. One LLM call.
plan = planner_llm(task)  # returns a list of {step, tool, args}

# Executor: walks the plan deterministically.
for step in plan:
    result = run_step(step)
    if not result.ok:
        break  # bail to a reviewer, don't keep guessing

Now the loop is a regular for over a finite list. The model is no longer driving control flow at runtime — it built the plan once, up front. If something goes wrong, you have a concrete plan you can inspect, edit, or re-run.

The tradeoff: you lose adaptive replanning. The model can't react to a tool's output mid-flight. For roughly 70% of the agent workloads I've built, this is fine. For the other 30%, you need replanning — which leads to step 2.

Step 2: make the state machine explicit

For the replanning case, the trick is to stop pretending your agent is a chatbot. It's a state machine. Make the states real:

STATES = ["planning", "executing", "reviewing", "done", "failed"]

def step(state, ctx):
    if state == "planning":
        ctx.plan = planner_llm(ctx.task)
        return "executing"
    if state == "executing":
        if ctx.cursor >= len(ctx.plan):
            return "reviewing"
        result = run_step(ctx.plan[ctx.cursor])
        ctx.cursor += 1
        if not result.ok:
            return "reviewing"  # let the reviewer decide what to do
        return "executing"
    if state == "reviewing":
        decision = reviewer_llm(ctx)  # "done" | "replan" | "fail"
        return {"done": "done",
                "replan": "planning",
                "fail": "failed"}[decision]

Now you can:

Cap total iterations per state (assert ctx.cursor < MAX_STEPS)
Persist ctx between steps so you can resume after a crash
Log every transition, which makes 3am debugging tractable
Restrict which LLM calls can happen in which state (no surprise tool calls during review)

This is the pattern I wish someone had shown me two years ago. It's the same idea as Erlang's gen_statem, or any workflow engine: separate "what state am I in" from "what should the model do here".

Step 3: constrain the model's output, don't parse it

The other class of bug that ate hours of my life: the model returns something almost right and the parser silently fails or hallucinates a tool call.

The fix is structured output. Most providers now support a JSON schema constraint at the API level. Use it:

schema = {
    "type": "object",
    "properties": {
        "action": {"enum": ["call_tool", "finish", "ask_user"]},
        "tool": {"type": "string"},
        "args": {"type": "object"},
    },
    "required": ["action"],
}

response = call_llm(
    history,
    response_format={"type": "json_schema", "json_schema": schema},
)

# response.action is guaranteed to be one of three strings.
# No more "DONE" / "Done" / "done." / "I am done." branching.

If you can't use schema-constrained output (some older models don't support it), at minimum validate with pydantic or zod before doing anything with the result, and treat validation failure as a known state, not an exception.

Prevention: a checklist I now run before shipping

After getting bitten enough times, I keep this taped to the side of my monitor:

Bounded iterations. Every loop that contains an LLM call has a hard cap. No while True.
Explicit states. If I can't draw the state diagram on a napkin, the agent is too complex.
Structured output. Every model response that drives control flow is schema-validated.
Idempotent tools. Tool calls assume they may be retried. Side effects are keyed by request ID.
Observability first. Every state transition is logged with the input/output of the LLM call. If I can't replay it, I can't debug it.
Tested failure modes. I have integration tests where the model returns garbage, times out, or returns a tool call to a non-existent tool. The agent should fail gracefully, not loop.

The 3am pager hasn't happened again. The agents look a lot less impressive from the outside — they're boring state machines now instead of dramatic recursive loops — but they actually work. The interesting work moved into the planner and reviewer prompts, which is where it belonged all along.