Pramoda Sahu

Posted on Jun 18

Understanding the Agent Loop: How Tool-Using LLM Systems Actually Work

#ai #opensource #python #devops

If you are building with tool-calling models, the most important design decision is often not the prompt. It is the loop around the model.

An LLM can decide it wants to use a tool, but it cannot execute that tool by itself. The surrounding application or SDK has to assemble context, inspect the model response, run tools, append results, and continue until a final answer is produced. That runtime cycle is the agent loop.

This article explains what the agent loop actually is, where the model stops and the harness begins, how tool calling works step by step, and which engineering tradeoffs show up once you move beyond demos.

TL;DR

An agent loop is the execution cycle that lets a model inspect context, request tools, observe results, and continue until it reaches a final answer.
The model is only one part of the system. The harness or SDK owns orchestration: prompt assembly, tool execution, retries, approvals, and termination.
State management matters as much as prompting. If you lose prior tool outputs or conversation continuity, the agent will behave like it forgot what just happened.
Performance depends heavily on prompt growth control, stable prompt prefixes, caching, and bounded tool output.
Safe agent design requires validation, approval gates for side effects, and clear rules for concurrency and history propagation.

The Agent Loop Is the System, Not Just the Model

The core problem is simple: a one-shot model call cannot inspect the world, act on it, and adapt to the result unless something outside the model manages that cycle.

That is the harness's job.

OpenAI's Codex architecture describes a user interaction as a turn, but a single turn may contain multiple internal iterations of model inference and tool execution. The OpenAI Agents SDK describes the same idea directly: invoke the agent, check whether there is final output, handle handoffs if needed, otherwise execute tool calls and re-run.

A practical mental model looks like this:

Build the input state.
Call the model.
Inspect the response.
If the model requested tools, validate and execute them.
Append tool results back into context.
Call the model again.
Stop only when the model returns a final answer.

That means the harness, not the model alone, is responsible for:

Prompt assembly
Message history management
Tool schema registration
Tool execution
Validation and error handling
Retry logic
Approval workflows
State persistence
Loop termination

This is why two systems using the same model can behave very differently. Their harnesses may make different decisions about context, tool ordering, truncation, approvals, and continuation.

What Goes Into a Single Turn

Before the loop can run, the system needs to define what the model sees.

The input state

A typical turn includes:

System or developer instructions
Tool definitions or schemas
Previous messages
Previous tool-call results
The current user request
Sometimes environment state, session metadata, or hidden runtime instructions

This matters because follow-up reasoning depends on prior observations being present. If the model requested a tool in one iteration and the result is not added back correctly, the next iteration cannot build on that work.

Inner loop vs outer loop

There are really two loops to think about:

Inner loop: model inference and tool execution inside a single user turn
Outer loop: the broader multi-turn conversation across user follow-ups

This distinction shows up clearly in Codex-style architectures. A user asks for something once, but the agent may internally perform several tool steps before replying. Then the next user message arrives, and the entire conversation thread continues from that accumulated state.

That is why state continuity is not optional. Without it, the outer loop breaks and the inner loop starts reasoning from an incomplete view of reality.

How the Model Decides Between Text and a Tool Call

Once the harness provides the current turn state, the model has a decision boundary: answer directly, or request one or more tools.

Tool calling works because the model is given structured tool definitions. Instead of producing only natural language, it can emit a structured request indicating which tool it wants and which arguments it wants to pass.

At that point, the model is effectively yielding control back to the application.

With custom tools, the client harness must take over, run the tool, and return the result. With hosted tools, more of that orchestration can happen inside the API itself.

This is an important architectural choice:

Tool type	Who orchestrates execution?	Main tradeoff
Hosted tool	API/runtime handles more of the loop	Simpler orchestration, less direct control
Custom function tool	Client harness executes it	More flexibility, more operational responsibility
MCP tool	Depends on integration and discovery flow	Adds discovery and caching concerns

The advantage of client-side orchestration is control. The cost is that you now own the failure modes.

Tool Execution Mechanics in Practice

Once the model emits a tool request, the harness needs to do more than just run it.

Validate before execution

A safe harness should validate:

Tool name
Argument structure
Argument types
Permission rules
Whether the tool is read-only or mutating

This is not just a security concern. It is also a quality concern. If the model asks for a tool with invalid arguments, returning an explicit tool error often gives it enough signal to self-correct on the next loop iteration.

Return the observation in the right format

The model needs a structured observation that closes the action-observation cycle.

A minimal pattern looks like this:

response = client.responses.create(
    input=initial_question,
    **MODEL_DEFAULTS,
)

while True:
    function_responses = invoke_functions_from_response(response)

    if len(function_responses) == 0:
        print(response.output_text)
        break

    print("More reasoning required, continuing...")
    response = client.responses.create(
        input=function_responses,
        previous_response_id=response.id,
        **MODEL_DEFAULTS,
    )

The key detail is not just the loop itself. It is that the next request continues from the previous response and includes the tool outputs produced by the harness.

A more explicit observation payload looks like this:

context.append({
    "type": "function_call_output",
    "call_id": tool_call.call_id,
    "output": str(result),
})

response_2 = client.responses.create(
    model="o3",
    input=context,
    tools=tools,
    store=False,
    include=["reasoning.encrypted_content"],
)

print(response_2.output_text)

That function_call_output item is the observation that lets the model continue reasoning with the tool result now available in context.

State Management Patterns: Where Many Agents Fail

One of the easiest ways to break an agent is to lose state continuity.

Common state strategies

There are several patterns in current OpenAI tooling:

Full history replay managed by the client
previous_response_id for server-managed continuation
conversation_id for conversation continuity
SDK-managed session persistence

Each approach has tradeoffs.

Full replay vs server-managed continuation

With full replay, the client sends all prior messages and tool results every time. This is simple to reason about, but payload size grows quickly.

With server-managed continuation, the client can send the new input along with a continuation identifier such as previous_response_id. That reduces payload size and offloads some history management.

This example from the Agents SDK shows response chaining:

from agents import Agent, Runner

async def main():
    agent = Agent(name="Assistant", instructions="Reply very concisely.")
    previous_response_id = None

    while True:
        user_input = input("You: ")

        # Setting auto_previous_response_id=True enables response chaining
        # automatically for the first turn, even when there is no actual
        # previous response ID yet.
        result = await Runner.run(
            agent,
            user_input,
            previous_response_id=previous_response_id,
            auto_previous_response_id=True,
        )

        previous_response_id = result.last_response_id
        print(f"Assistant: {result.final_output}")

This is convenient, but you still need to choose a consistent state strategy.

Do not mix incompatible modes

The Agents SDK documentation explicitly warns against combining session persistence with conversation_id, previous_response_id, or auto_previous_response_id in the same run path.

That is a practical design rule: pick one continuity model per call flow.

If you mix them, debugging becomes much harder because it is no longer obvious which state the model is actually seeing.

Prompt Growth, Caching, and Why Stable Prefixes Matter

As the loop continues, context grows.

Every new model call may include prior instructions, tool schemas, user messages, and tool outputs. If you simply keep appending everything forever, the number of bytes sent over the lifetime of a conversation can grow quickly.

Why Codex emphasizes prompt prefixes

The Codex architecture discussion highlights a useful principle: keep old prompt content as an exact prefix of the new prompt whenever possible. That improves prompt-cache reuse.

In practical terms, stable ordering matters for:

System instructions
Tool definitions
Environment metadata
Prior messages

If these move around between calls, cacheability drops. The same issue affects reproducibility. Even tool-definition ordering bugs can introduce cache misses and inconsistent behavior.

Compaction strategies

A production harness usually needs some combination of:

Truncating verbose tool output
Summarizing old history
Keeping static instructions stable and early
Bounding shell or retrieval output
Preserving only the most relevant observations verbatim

This matters even more for shell, retrieval, or computer-use tasks, where output can become noisy very quickly.

The goal is not just lower cost. It is maintaining a usable reasoning substrate for the model.

Safety and Control in the Loop

The more powerful the tools, the more important the harness becomes.

Approval gates and side effects

Read-only tool calls are different from side-effectful operations.

For example:

Fetching documentation is relatively low risk
Sending an email, editing a file, or executing a deployment is high risk

Mutating actions should often be:

Serialized instead of run concurrently
Approval-gated
Sandboxed when possible
Logged with enough metadata for auditability

This is one reason agent frameworks expose concurrency settings and approval workflows.

Validate arguments, not intentions

You cannot safely assume that a tool request is correct just because it came from the model. Validate the arguments before execution, and return structured error feedback when something is wrong.

That gives the loop a chance to recover without silently doing the wrong thing.

Do not over-prompt reasoning models

OpenAI's function-calling guidance for reasoning models notes that you should not force extra "think more before every function call" prompting. Reasoning models already perform internal reasoning, and excessive prompting can degrade performance.

That is a useful reminder that harness quality is often more important than prompt verbosity.

Multi-Agent Extensions and Their Tradeoffs

Once a single-agent loop works, teams often add handoffs or agent-as-tool patterns.

Conceptually, the loop stays the same:

Invoke one agent.
Detect whether it produced final output, a tool request, or a handoff.
Route execution accordingly.
Continue until termination.

The Agents SDK summarizes the semantics clearly:

The agent will run in a loop until a final output is generated. The loop runs like so:

1. The agent is invoked with the given input.
2. If there is a final output (i.e. the agent produces something of type `agent.output_type`), the loop terminates.
3. If there's a handoff, we run the loop again, with the new agent.
4. Else, we run tool calls (if any), and re-run the loop.

The tricky part is not the idea of handoffs. It is history propagation.

Recent community discussions show that when one agent is exposed as a tool to another, developers are often unsure how much history is forwarded automatically. In practice, this means you should not assume that all relevant context follows the handoff unless your framework explicitly guarantees it.

For multi-agent systems, explicit context composition is often safer than implicit inheritance.

Common Failure Modes and Debugging Strategies

Most agent bugs look obvious in hindsight.

Failure mode 1: losing continuity

Symptoms:

The agent repeats itself
It forgets prior tool results
MCP tool discovery keeps happening again

Check whether you are correctly passing previous_response_id, conversation_id, or full message history.

Failure mode 2: context flooding

Symptoms:

Long, low-quality responses
Poor tool selection
The model misses relevant facts

Check whether tool output is too verbose. Cap output size, summarize logs, and keep only useful observations.

Failure mode 3: unstable prompt construction

Symptoms:

Cache misses
Inconsistent behavior across similar runs
Higher token usage than expected

Check the ordering of instructions, tool schemas, and environment metadata.

Failure mode 4: unsafe tool execution

Symptoms:

Invalid API calls
Accidental side effects
Hard-to-reproduce failures

Validate tool names and arguments before execution. Treat tool requests as proposals, not commands.

Failure mode 5: incorrect concurrency

Symptoms:

Race conditions
Conflicting writes
Non-deterministic outcomes

Run read-only operations concurrently only when safe. Serialize or approval-gate mutating operations.

Practical Architecture Takeaways

The recent OpenAI ecosystem changes make one thing clear: the important boundary is no longer just model prompting. It is orchestration design.

The Responses API, Agents SDK, MCP integrations, and Codex harness examples all point to the same execution model:

The model chooses actions
The harness controls reality
State continuity determines coherence
Prompt discipline determines scalability
Safety controls determine whether the system is usable in practice

If you are building an agent today, the fastest path to a better system is often not a new prompt. It is a better loop.

Key Takeaways

The agent loop is the action-observation cycle that makes tool-using LLM systems possible.
The harness owns orchestration: context assembly, tool execution, validation, retries, approvals, and termination.
State continuity is critical. Losing prior responses or tool outputs breaks reasoning quality quickly.
Server-managed continuation can simplify history handling, but you should choose one state strategy consistently.
Prompt growth is an engineering problem. Stable prefixes, truncation, compaction, and bounded tool output all matter.
Hosted tools and custom tools shift the orchestration boundary in different ways.
Multi-agent patterns introduce history propagation and control-flow complexity that should be designed explicitly.
Safe execution requires argument validation, side-effect controls, and careful concurrency handling.

Conclusion

An agent loop is not a small implementation detail. It is the core runtime pattern that turns a model into a working system.

Once you see the loop clearly, many design decisions make more sense: why history management matters, why tool output must be bounded, why prompt ordering affects cacheability, and why side effects need approval and validation.

If you are building with tool-calling models, make the loop explicit first. Define how state is carried forward, how tools are validated, how observations are appended, and how the run terminates. In practice, that foundation will usually improve reliability more than any prompt tweak.

DEV Community

Understanding the Agent Loop: How Tool-Using LLM Systems Actually Work

TL;DR

The Agent Loop Is the System, Not Just the Model

What Goes Into a Single Turn

The input state

Inner loop vs outer loop

How the Model Decides Between Text and a Tool Call

Tool Execution Mechanics in Practice

Validate before execution

Return the observation in the right format

State Management Patterns: Where Many Agents Fail

Common state strategies

Full replay vs server-managed continuation

Do not mix incompatible modes

Prompt Growth, Caching, and Why Stable Prefixes Matter

Why Codex emphasizes prompt prefixes

Compaction strategies

Safety and Control in the Loop

Approval gates and side effects

Validate arguments, not intentions

Do not over-prompt reasoning models

Multi-Agent Extensions and Their Tradeoffs

Common Failure Modes and Debugging Strategies

Failure mode 1: losing continuity

Failure mode 2: context flooding

Failure mode 3: unstable prompt construction

Failure mode 4: unsafe tool execution

Failure mode 5: incorrect concurrency

Practical Architecture Takeaways

Key Takeaways

Further Reading

Conclusion

Top comments (0)