Before You Add More Agents, Design the Control Plane

#agents #architecture #openai #systemdesign

OpenAI Agents Python makes it easy to describe agents, connect tools, define handoffs, and run agentic workflows. That is useful, but it also creates a trap: teams may start by adding more agents before they define the operational boundaries that make those agents safe to use in a real repository.

The hard part is usually not getting the first demo to run. The hard part is knowing when an agent should start, what it is allowed to touch, what evidence it must leave behind, when it can hand work to another agent, and how the team will recover when the workflow fails.

For production use, I would start with a control plane.

This does not need to be a heavy platform. A Markdown checklist, a JSON policy file, and a trace log can be enough for the first version. The key is that the rules exist outside the model's temporary reasoning. They become part of the workflow, not something you hope the model remembers.

1. Define the task entry contract

An agent should not start from a vague instruction like "fix this feature" or "improve this repo." That may be fine for a toy demo, but it is too wide for real work.

A task entry contract should answer five questions:

What is the goal?
What input is trusted?
What files, services, or systems are in scope?
What is the acceptance standard?
When should the agent stop instead of improvising?

For example, a safer engineering task might say:

Read only the package under src/connectors. Modify only connector_policy.py and related tests. Preserve the git diff. Run the connector test suite. If the requested behavior conflicts with an existing policy rule, stop and return the conflict instead of rewriting the policy.

That kind of instruction is not just prompt polish. It reduces the agent's degrees of freedom. It turns an open-ended request into an executable contract.

The business value is simple: fewer surprising edits, fewer review cycles, and less time spent asking why an agent touched something unrelated.

2. Separate tools by risk

Tool access should not be binary. "The agent can use tools" is too broad. A file search is not the same risk as deleting a directory, publishing an article, or calling a production API.

I prefer three buckets.

Low-risk tools can run directly. Examples: read a file, search for symbols, inspect documentation, list a directory, or open a local artifact.

Medium-risk tools can run if they leave evidence. Examples: modify a draft, generate a patch, run tests, create a report, or produce a migration plan. The output should be inspectable.

High-risk tools require an explicit gate. Examples: destructive git commands, deleting files, pushing to a remote, publishing content, spending money, modifying production infrastructure, or calling external APIs with side effects.

OpenAI Agents Python gives you a framework for building the workflow. It does not automatically know your risk model. That risk model belongs in your engineering system.

If your agent can publish content, the publication action should not be treated the same way as writing a local draft. If your agent can modify code, the modification should not be treated the same way as reading code. If your agent can call production services, the system needs a gate before side effects happen.

This is where many agent workflows become fragile. The model may be capable, but the surrounding system has no authority model.

3. Make handoffs evidence-based

Multi-agent workflows are attractive because they map nicely to human roles: researcher, planner, coder, reviewer, publisher. But every handoff creates a new failure point.

A handoff table should define:

When a handoff is allowed
Which agent receives the task
What evidence must be passed along
Which cases block the handoff

A research agent should not hand work to a writing agent by saying "I found the sources." It should pass source links, key claims, contradictions, uncertain points, and the reason those sources are relevant.

A coding agent should not hand work to a release agent by saying "fixed." It should pass the diff, tests run, tests skipped, remaining risk, and rollback path.

That evidence is the difference between agentic collaboration and a chain of guesses.

The more agents you add, the more important this becomes. Without evidence-based handoffs, every downstream agent has to infer what the upstream agent meant. That makes failures harder to debug and easier to repeat.

4. Treat trace as a product feature

When an agent workflow fails, the least useful conclusion is "the model was unreliable." That may be true, but it does not tell you what to improve.

A useful trace should capture:

The task goal
The input and source material
The rules that were active
The tools that were called
The files or external systems touched
The verification result
The failure reason
The rule or workflow change suggested for next time

You do not need a complex observability backend on day one. A structured Markdown worklog or JSONL trace can be enough. What matters is that failures become training material for the system.

If a failure came from a vague task, improve the task entry contract. If a failure came from excessive permission, tighten the tool policy. If a failure came from a weak handoff, change the handoff table. If a failure came from missing verification, add a test or preflight check.

This is how an agent workflow gets more reliable over time. Not by hoping the next model will magically be better, but by converting failures into rules.

5. Start with one real workflow

The wrong move is to design a giant multi-agent platform first. The better move is to choose one low-risk but real workflow.

Good first workflows include:

Updating documentation after a code change
Reviewing a pull request for missing tests
Classifying issues into actionable buckets
Producing a release note from a verified diff
Preparing a technical article draft with source links and disclosure

For each workflow, define the entry contract, tool policy, handoff table, and trace format. Then run it repeatedly. The goal is not to prove that agents are impressive. The goal is to prove that the workflow reduces repeated coordination while preserving reviewability and rollback.

If a small workflow becomes stable, expand it. If it keeps failing, the trace should tell you whether the problem is the task, the permissions, the handoff, the verification, or the model.

The practical takeaway

OpenAI Agents Python is a useful foundation for building agent workflows. But the production value comes from the control plane around it.

Before adding more agents, define:

How tasks enter the system
Which tools are allowed under which conditions
What evidence is required for handoff
How traces feed back into better rules

That is less exciting than a flashy demo, but it is the difference between an agent that merely runs and an agent workflow that a team can actually trust.

Disclosure: this is an unofficial Doramagic technical note. It is not an official OpenAI publication and does not represent the upstream project unless explicitly stated by that project.