If you are building with tool-calling models, the most important design decision is often not the prompt. It is the loop around the model.
An LLM can decide it wants to use a tool, but it cannot execute that tool by itself. The surrounding application or SDK has to assemble context, inspect the model response, run tools, append results, and continue until a final answer is produced. That runtime cycle is the agent loop.
This article explains what the agent loop actually is, where the model stops and the harness begins, how tool calling works step by step, and which engineering tradeoffs show up once you move beyond demos.
TL;DR
- An agent loop is the execution cycle that lets a model inspect context, request tools, observe results, and continue until it reaches a final answer.
- The model is only one part of the system. The harness or SDK owns orchestration: prompt assembly, tool execution, retries, approvals, and termination.
- State management matters as much as prompting. If you lose prior tool outputs or conversation continuity, the agent will behave like it forgot what just happened.
- Performance depends heavily on prompt growth control, stable prompt prefixes, caching, and bounded tool output.
- Safe agent design requires validation, approval gates for side effects, and clear rules for concurrency and history propagation.
The Agent Loop Is the System, Not Just the Model
The core problem is simple: a one-shot model call cannot inspect the world, act on it, and adapt to the result unless something outside the model manages that cycle.
That is the harness's job.
OpenAI's Codex architecture describes a user interaction as a turn, but a single turn may contain multiple internal iterations of model inference and tool execution. The OpenAI Agents SDK describes the same idea directly: invoke the agent, check whether there is final output, handle handoffs if needed, otherwise execute tool calls and re-run.
A practical mental model looks like this:
- Build the input state.
- Call the model.
- Inspect the response.
- If the model requested tools, validate and execute them.
- Append tool results back into context.
- Call the model again.
- Stop only when the model returns a final answer.
That means the harness, not the model alone, is responsible for:
- Prompt assembly
- Message history management
- Tool schema registration
- Tool execution
- Validation and error handling
- Retry logic
- Approval workflows
- State persistence
- Loop termination
This is why two systems using the same model can behave very differently. Their harnesses may make different decisions about context, tool ordering, truncation, approvals, and continuation.
What Goes Into a Single Turn
Before the loop can run, the system needs to define what the model sees.
The input state
A typical turn includes:
- System or developer instructions
- Tool definitions or schemas
- Previous messages
- Previous tool-call results
- The current user request
- Sometimes environment state, session metadata, or hidden runtime instructions
This matters because follow-up reasoning depends on prior observations being present. If the model requested a tool in one iteration and the result is not added back correctly, the next iteration cannot build on that work.
Inner loop vs outer loop
There are really two loops to think about:
- Inner loop: model inference and tool execution inside a single user turn
- Outer loop: the broader multi-turn conversation across user follow-ups
This distinction shows up clearly in Codex-style architectures. A user asks for something once, but the agent may internally perform several tool steps before replying. Then the next user message arrives, and the entire conversation thread continues from that accumulated state.
That is why state continuity is not optional. Without it, the outer loop breaks and the inner loop starts reasoning from an incomplete view of reality.
How the Model Decides Between Text and a Tool Call
Once the harness provides the current turn state, the model has a decision boundary: answer directly, or request one or more tools.
Tool calling works because the model is given structured tool definitions. Instead of producing only natural language, it can emit a structured request indicating which tool it wants and which arguments it wants to pass.
At that point, the model is effectively yielding control back to the application.
With custom tools, the client harness must take over, run the tool, and return the result. With hosted tools, more of that orchestration can happen inside the API itself.
This is an important architectural choice:
| Tool type | Who orchestrates execution? | Main tradeoff |
|---|---|---|
| Hosted tool | API/runtime handles more of the loop | Simpler orchestration, less direct control |
| Custom function tool | Client harness executes it | More flexibility, more operational responsibility |
| MCP tool | Depends on integration and discovery flow | Adds discovery and caching concerns |
The advantage of client-side orchestration is control. The cost is that you now own the failure modes.
Tool Execution Mechanics in Practice
Once the model emits a tool request, the harness needs to do more than just run it.
Validate before execution
A safe harness should validate:
- Tool name
- Argument structure
- Argument types
- Permission rules
- Whether the tool is read-only or mutating
This is not just a security concern. It is also a quality concern. If the model asks for a tool with invalid arguments, returning an explicit tool error often gives it enough signal to self-correct on the next loop iteration.
Return the observation in the right format
The model needs a structured observation that closes the action-observation cycle.
A minimal pattern looks like this:
response = client.responses.create(
input=initial_question,
**MODEL_DEFAULTS,
)
while True:
function_responses = invoke_functions_from_response(response)
if len(function_responses) == 0:
print(response.output_text)
break
print("More reasoning required, continuing...")
response = client.responses.create(
input=function_responses,
previous_response_id=response.id,
**MODEL_DEFAULTS,
)
The key detail is not just the loop itself. It is that the next request continues from the previous response and includes the tool outputs produced by the harness.
A more explicit observation payload looks like this:
context.append({
"type": "function_call_output",
"call_id": tool_call.call_id,
"output": str(result),
})
response_2 = client.responses.create(
model="o3",
input=context,
tools=tools,
store=False,
include=["reasoning.encrypted_content"],
)
print(response_2.output_text)
That function_call_output item is the observation that lets the model continue reasoning with the tool result now available in context.
State Management Patterns: Where Many Agents Fail
One of the easiest ways to break an agent is to lose state continuity.
Common state strategies
There are several patterns in current OpenAI tooling:
- Full history replay managed by the client
-
previous_response_idfor server-managed continuation -
conversation_idfor conversation continuity - SDK-managed session persistence
Each approach has tradeoffs.
Full replay vs server-managed continuation
With full replay, the client sends all prior messages and tool results every time. This is simple to reason about, but payload size grows quickly.
With server-managed continuation, the client can send the new input along with a continuation identifier such as previous_response_id. That reduces payload size and offloads some history management.
This example from the Agents SDK shows response chaining:
from agents import Agent, Runner
async def main():
agent = Agent(name="Assistant", instructions="Reply very concisely.")
previous_response_id = None
while True:
user_input = input("You: ")
# Setting auto_previous_response_id=True enables response chaining
# automatically for the first turn, even when there is no actual
# previous response ID yet.
result = await Runner.run(
agent,
user_input,
previous_response_id=previous_response_id,
auto_previous_response_id=True,
)
previous_response_id = result.last_response_id
print(f"Assistant: {result.final_output}")
This is convenient, but you still need to choose a consistent state strategy.
Do not mix incompatible modes
The Agents SDK documentation explicitly warns against combining session persistence with conversation_id, previous_response_id, or auto_previous_response_id in the same run path.
That is a practical design rule: pick one continuity model per call flow.
If you mix them, debugging becomes much harder because it is no longer obvious which state the model is actually seeing.
Prompt Growth, Caching, and Why Stable Prefixes Matter
As the loop continues, context grows.
Every new model call may include prior instructions, tool schemas, user messages, and tool outputs. If you simply keep appending everything forever, the number of bytes sent over the lifetime of a conversation can grow quickly.
Why Codex emphasizes prompt prefixes
The Codex architecture discussion highlights a useful principle: keep old prompt content as an exact prefix of the new prompt whenever possible. That improves prompt-cache reuse.
In practical terms, stable ordering matters for:
- System instructions
- Tool definitions
- Environment metadata
- Prior messages
If these move around between calls, cacheability drops. The same issue affects reproducibility. Even tool-definition ordering bugs can introduce cache misses and inconsistent behavior.
Compaction strategies
A production harness usually needs some combination of:
- Truncating verbose tool output
- Summarizing old history
- Keeping static instructions stable and early
- Bounding shell or retrieval output
- Preserving only the most relevant observations verbatim
This matters even more for shell, retrieval, or computer-use tasks, where output can become noisy very quickly.
The goal is not just lower cost. It is maintaining a usable reasoning substrate for the model.
Safety and Control in the Loop
The more powerful the tools, the more important the harness becomes.
Approval gates and side effects
Read-only tool calls are different from side-effectful operations.
For example:
- Fetching documentation is relatively low risk
- Sending an email, editing a file, or executing a deployment is high risk
Mutating actions should often be:
- Serialized instead of run concurrently
- Approval-gated
- Sandboxed when possible
- Logged with enough metadata for auditability
This is one reason agent frameworks expose concurrency settings and approval workflows.
Validate arguments, not intentions
You cannot safely assume that a tool request is correct just because it came from the model. Validate the arguments before execution, and return structured error feedback when something is wrong.
That gives the loop a chance to recover without silently doing the wrong thing.
Do not over-prompt reasoning models
OpenAI's function-calling guidance for reasoning models notes that you should not force extra "think more before every function call" prompting. Reasoning models already perform internal reasoning, and excessive prompting can degrade performance.
That is a useful reminder that harness quality is often more important than prompt verbosity.
Multi-Agent Extensions and Their Tradeoffs
Once a single-agent loop works, teams often add handoffs or agent-as-tool patterns.
Conceptually, the loop stays the same:
- Invoke one agent.
- Detect whether it produced final output, a tool request, or a handoff.
- Route execution accordingly.
- Continue until termination.
The Agents SDK summarizes the semantics clearly:
The agent will run in a loop until a final output is generated. The loop runs like so:
1. The agent is invoked with the given input.
2. If there is a final output (i.e. the agent produces something of type `agent.output_type`), the loop terminates.
3. If there's a handoff, we run the loop again, with the new agent.
4. Else, we run tool calls (if any), and re-run the loop.
The tricky part is not the idea of handoffs. It is history propagation.
Recent community discussions show that when one agent is exposed as a tool to another, developers are often unsure how much history is forwarded automatically. In practice, this means you should not assume that all relevant context follows the handoff unless your framework explicitly guarantees it.
For multi-agent systems, explicit context composition is often safer than implicit inheritance.
Common Failure Modes and Debugging Strategies
Most agent bugs look obvious in hindsight.
Failure mode 1: losing continuity
Symptoms:
- The agent repeats itself
- It forgets prior tool results
- MCP tool discovery keeps happening again
Check whether you are correctly passing previous_response_id, conversation_id, or full message history.
Failure mode 2: context flooding
Symptoms:
- Long, low-quality responses
- Poor tool selection
- The model misses relevant facts
Check whether tool output is too verbose. Cap output size, summarize logs, and keep only useful observations.
Failure mode 3: unstable prompt construction
Symptoms:
- Cache misses
- Inconsistent behavior across similar runs
- Higher token usage than expected
Check the ordering of instructions, tool schemas, and environment metadata.
Failure mode 4: unsafe tool execution
Symptoms:
- Invalid API calls
- Accidental side effects
- Hard-to-reproduce failures
Validate tool names and arguments before execution. Treat tool requests as proposals, not commands.
Failure mode 5: incorrect concurrency
Symptoms:
- Race conditions
- Conflicting writes
- Non-deterministic outcomes
Run read-only operations concurrently only when safe. Serialize or approval-gate mutating operations.
Practical Architecture Takeaways
The recent OpenAI ecosystem changes make one thing clear: the important boundary is no longer just model prompting. It is orchestration design.
The Responses API, Agents SDK, MCP integrations, and Codex harness examples all point to the same execution model:
- The model chooses actions
- The harness controls reality
- State continuity determines coherence
- Prompt discipline determines scalability
- Safety controls determine whether the system is usable in practice
If you are building an agent today, the fastest path to a better system is often not a new prompt. It is a better loop.
Key Takeaways
- The agent loop is the action-observation cycle that makes tool-using LLM systems possible.
- The harness owns orchestration: context assembly, tool execution, validation, retries, approvals, and termination.
- State continuity is critical. Losing prior responses or tool outputs breaks reasoning quality quickly.
- Server-managed continuation can simplify history handling, but you should choose one state strategy consistently.
- Prompt growth is an engineering problem. Stable prefixes, truncation, compaction, and bounded tool output all matter.
- Hosted tools and custom tools shift the orchestration boundary in different ways.
- Multi-agent patterns introduce history propagation and control-flow complexity that should be designed explicitly.
- Safe execution requires argument validation, side-effect controls, and careful concurrency handling.
Further Reading
If you want to go deeper, these resources are worth reading next:
- OpenAI function-calling guide
- OpenAI reasoning function-calls cookbook
- OpenAI Agents SDK running agents documentation
- OpenAI's Codex architecture write-up on the agent loop
- OpenAI MCP tool guide
Conclusion
An agent loop is not a small implementation detail. It is the core runtime pattern that turns a model into a working system.
Once you see the loop clearly, many design decisions make more sense: why history management matters, why tool output must be bounded, why prompt ordering affects cacheability, and why side effects need approval and validation.
If you are building with tool-calling models, make the loop explicit first. Define how state is carried forward, how tools are validated, how observations are appended, and how the run terminates. In practice, that foundation will usually improve reliability more than any prompt tweak.
Top comments (0)