What Makes an LLM an Agent: The Model Controls the Control Flow

#ai #agents #python #llm

Book: Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust
Also by me: Observability for LLM Applications — the companion book in The AI Engineer's Library (2-book series)
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You are the third reviewer on a PR called agent-rewrite. You open the diff and read a class named ResearchAgent. It calls an LLM, parses the response, dispatches to one of six tools, appends the result to a list, and calls the LLM again. There is a max_steps=12. There is a system prompt loaded from YAML. Every line is sensible.

Then you open the old pipeline this PR is replacing. It has a function called classify_intent. It calls an LLM, gets back one of six labels, and a switch routes each label to a handler. No loop. No max_steps. But the LLM still picks what to do, once, at the top.

Two files. Both call an LLM. Both have tools. You cannot say, in one sentence, why one is an "agent" and the other is "just a pipeline." That gap is the whole post.

The one property

An agent is a system where the model owns the control flow.

That is the entire distinction. If the model decides what happens next, you have an agent. If your code decides what happens next and the model fills in the blanks, you have a pipeline with LLM calls in it. Both ship. Both are useful. You cannot operate them the same way.

The line is not mine. It is Anthropic's, from their December 2024 post Building effective agents:

Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.

The load-bearing word is dynamically. In a workflow, the shape of the execution exists before the model is called. You can grep for the code paths. You can draw them on a whiteboard, and the drawing matches the run. In an agent, the shape is assembled at inference time, one step at a time, from the tools the model has and what it is observing. There is no execution shape until the model decides one, and today's decision is not tomorrow's on the same input.

The two files, stripped down

Here is the old pipeline in skeleton form. This is a router.

def handle(request):
    label = classify(
        request,
        labels=["search", "lookup", "summarize",
                "translate", "escalate", "reject"],
    )
    handler = DISPATCH[label]
    return handler(request)

One LLM call. One branch. The model picks a label from six. Your code picks the handler. You could draw that execution graph on a napkin before the run, and it would be right every time. It has an LLM in it. It is not an agent.

Here is the new ResearchAgent, stripped to the same size.

def handle(request):
    state = {"request": request, "history": []}
    for step in range(MAX_STEPS):
        decision = decide(state, tools=TOOLS)
        if decision.done:
            return decision.answer
        result = TOOLS[decision.tool](decision.args)
        state["history"].append((decision, result))
    raise StepBudgetExceeded()

Same line count. Different universe. The model is called in a loop. Each turn it reads the state, picks a tool or declares it is done, and the loop advances. The graph is not drawn before the run — the graph is what comes out the other end of the run. The model might call one tool. It might call twelve. It might call the same tool three times because it misread the last result. MAX_STEPS is a seatbelt, not a structure.

Ask the one question that matters. In the first file, what picks the next step? DISPATCH does. In the second, what picks the next step? decide does. Different answer, different system.

What `decide` actually looks like

That decide call is where the model takes the wheel. With the Anthropic SDK it is a single message with tools attached and the running history replayed each turn.

import anthropic

client = anthropic.Anthropic()

TOOLS = [
    {
        "name": "search_docs",
        "description": "Search the internal docs index.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    },
]

def decide(messages):
    return client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        tools=TOOLS,
        messages=messages,
    )

The model returns either a final text answer or one or more tool_use blocks. Your code runs the tool, appends the result, and calls decide again. You never wrote the branch that says "after searching, summarize." The model wrote it, at runtime, against a history that includes the output of the previous tool.

def run(user_input):
    messages = [{"role": "user", "content": user_input}]
    for _ in range(MAX_STEPS):
        resp = decide(messages)
        messages.append(
            {"role": "assistant", "content": resp.content}
        )
        calls = [b for b in resp.content
                 if b.type == "tool_use"]
        if not calls:
            return resp  # model decided it was done
        results = []
        for c in calls:
            out = TOOLS_IMPL[c.name](**c.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": c.id,
                "content": out,
            })
        messages.append(
            {"role": "user", "content": results}
        )
    raise StepBudgetExceeded()

Notice what is missing: nowhere does a human choose what the second tool call should be. That choice happens at inference time, made by a model you did not train, on a prompt you only partly control, reading tool outputs that feed back into the next decision.

Why this breaks your instincts

Every reflex you built debugging pipelines assumes your code chose what happened next. That is what made behavior reproducible, tests meaningful, postmortems possible. You changed an input, read the branch, found the bug.

When the model owns the control flow, that reflex stops. The same input can produce different execution graphs on different runs. A bug can be invisible on the step where it starts and only surface five steps later, after four tool calls that each looked fine alone. Your handler unit tests can pass at 100% while the system fails 40% of the time end to end, because the failure lives in the order the model picked, not in any single piece.

These are not hypothetical. In GitHub issue #44726 on anthropics/claude-code, a user reports sessions with input/output token ratios of 74:1 and 175:1 against a normal 5:1 to 15:1 range, attributed to a compounding loop where accumulated history and file context grew unbounded across tool calls. Other user-filed issues in the same repo report infinite compaction loops and runaway write loops consuming large amounts of memory. None of those failure modes exist in a router. They exist in systems where the model owns the control flow.

A pipeline can also loop forever, of course — if you write an infinite loop into it. The difference is where the failure lives. In a pipeline it lives at authoring time; you find it in code review and fix it with a break. In an agent the loop is the feature you deliberately wrote, and the thing deciding whether to exit is the model. You cannot code-review your way out of it.

The test for the code you already have

Run the thing your team calls "the agent" through three questions.

Who picks the next step? Trace one real request. For each transition after a tool returns, ask whether your Python or the model chose the next tool. If it is the model every time, agent. If it is the switch in dispatch.py, workflow.

Who decides when to stop? In a workflow, termination is an edge in the graph. In an agent, termination is a decision the model makes each turn, and max_steps is the seatbelt that catches it when the decision does not come. If your seatbelt has ever fired in production, you have an agent.

Could you draw the execution graph before the run? If the shape is a function of your code, workflow. If the best you can do is draw a loop and write "the model picks" over the branches, agent.

Anthropic, worth noting, is the most conservative voice here. The same post tells you to find the simplest solution first and increase complexity only when needed, which "might mean not building agentic systems at all." Most systems shipped with "agent" in the name should be workflows. Build the agent when the task's shape genuinely is not known in advance. Otherwise do not, and lie to no one about what you shipped.

Open the file your team has been calling an agent. Find the loop. Read the exit condition. Ask who wrote it, then ask who actually decides, at runtime, when it fires.

If you draw the line where Anthropic draws it, the payoff is the failure catalog it predicts: runaway loops, cost tied to context length, tool selection that decays as the registry grows. Agents in Production is the build-and-ship side of that catalog — loop budgets, seatbelts, and the operational habits that keep a model-driven loop honest. Observability for LLM Applications is the other half: tracing those non-deterministic runs and pricing them before finance does. Both books sit in The AI Engineer's Library.