Ricardo Lara

Posted on Mar 28

Part 2: what changed when I stopped treating my multi-agent system as an idea and started running it for real

#agents #ai #architecture #llm

In the first part, I explained why I ended up building a multi-agent flow instead of continuing to push everything into a single conversation. The idea still made sense: separating responsibilities, using different models depending on the phase, and keeping human approval before implementation gave me more order, better cost control, and less noise.

But at that stage I was still solving the conceptual problem.

This second stage was different. It was no longer about defending the idea, but about actually running it. And that was where the problems appeared that do not show up in a diagram or in a strong narrative: permissions, AI systems that do not behave the same way, processes that need a real interactive terminal, configuration that ages badly, and orchestration decisions that sound good on paper but do not hold up in practice.

The biggest change: I stopped thinking in terms of a pipeline and started thinking in terms of a runtime

I think the best way to explain this evolution is this: agentflow stopped being just a way to organize prompts, files, and steps, and started becoming an explicit runtime.

That changed the way I saw it.

Before, the configuration described more of what had to be generated. Now it describes how each role runs: which provider it uses, which model, which effort level, which sandbox, and which prompt governs it. It is no longer just a tool to assemble a flow, but a base for running real roles with more control.

That shift matters because it made me see something that was not fully clear before: separating roles well is not enough. You also need the runtime that makes that separation viable.

Real execution was what showed where the gap actually was

The clearest finding of this stage was simple and brutal: the agents could not write files autonomously.

Not because the design was missing. The sandbox field already existed and the intent was correct. The problem was more uncomfortable: the Claude adapter was not translating that intent into the real CLI flags. So the system ran, but it kept getting blocked asking for permission on every write.

That was one of those moments that forces you to land the idea in reality. Because that is when you understand that a system running and a system working are not the same thing.

The fix was direct, but the lesson mattered more than the fix. For Claude Code I had to translate workspace-write into --dangerously-skip-permissions and read-only into --permission-mode plan. Codex already handled that side more cleanly with --sandbox workspace-write. OpenCode, on the other hand, still has a more structural limitation because its CLI does not expose an equivalent flag.

You do not discover that problem by refining prompts. You discover it by running the system.

It also became clear that orchestration is not the same as delegation

Another thing this stage made very clear was that the orchestrator I had in mind was still not closing the last mile well enough.

In theory, agentflow run already existed and already had sequencing logic. But in practice, when Claude Code, Codex, or OpenCode were participating in a real session, that command was not enough. The bootstrap skills were too shallow. They mostly delegated and stopped there. They did not provide enough context to decide when to stop, which steps to run, how to handle the review loop, or when to ask for human approval.

That was when something became obvious that now feels self-evident: there is no single correct mode of orchestration.

CLI mode makes sense for automation, CI, and deterministic execution. But an interactive session needs something else. It needs the agent to have judgment to classify the task, present a plan, wait for approval before implementing, and decide how to move forward based on context. Trying to force the same mechanism to work both for automation and CI and for an interactive session with judgment and human approval created more friction than it removed.

Not every task deserves the full pipeline

Another improvement that feels genuinely important in this stage was finally grounding the classifier.

In part 1, the intuition was already there: not every task should cost the same. But it was still more of a thesis than a real system capability.

Now there is a role that classifies complexity as small, medium, or large, and that changes the flow. A small change does not need to go through the full ceremony. A larger task does justify the complete pipeline, a review loop, and possible model adjustments. And if a project comes from an older configuration and does not yet have that role, the system does not break: it falls back to a heuristic path and keeps running.

I like this because it moves optimization to the right place. Not after spending time and tokens, but before.

Put more simply: it makes no sense to treat a small bug like a complex migration.

Providers are not interchangeable

This stage also helped me let go of a simplification that is tempting in the abstract: thinking that all AI providers are more or less the same.

They are not.

When I talk about an AI provider, I mean the external system that executes a specific task inside the flow. It can be Claude, Codex, or something else. It is basically the service I delegate work to in one phase of the process.

And once I pushed this into real execution, it became clear that these providers do not behave in the same way. Permissions change, integration styles change, process handling changes, and even the way they expect to be run changes.

In some cases, it is also not enough to launch a command and wait for a response. Some tools need to run inside a real interactive terminal, as if they were opened directly in the console. That is usually called a TTY, but in plain language it means this: the tool needs a real terminal to work properly.

That is what pushed me toward different execution strategies depending on the provider. For some cases, a pipe-based execution worked fine. For Codex, I ended up needing a real PTY with node-pty, because its interface can fail or hang if it does not run in a genuinely interactive terminal.

It sounds like a minor detail, but it is not. Because working with agents is not only about working with text. It is also about processes, permissions, terminals, and real errors. And if that is not designed well, the whole system feels fragile even if the core idea is strong.

Several useful improvements were not flashy, but they were necessary

There were also less visible improvements that mattered more than they seem.

One was stopping the dependency on a rigid testRunner field in the config. That kind of field ages badly. You change the project, change the runner, or change the stack, and you end up carrying stale instructions. It felt much better to let the tester detect that from the project itself when the field is not defined.

These are not flashy changes, but they are the kind that make a tool stop feeling rigid.

Not everything is closed yet, but these are the right problems to have

I do not want to describe this stage as if everything were already perfect, because that would not be true.

There are still open gaps. The test suite passes, but it still does not cover the runtime-first contract deeply enough, especially around real adapter execution, agent run, and the classifier. The documentation is already much more aligned with the current runtime, but OpenCode still has a real sandbox limitation that does not depend only on agentflow.

But honestly, those already feel like healthy problems.

Because I am no longer debating whether the idea makes sense. I am no longer in the phase of defending the thesis. I am now in the phase of closing concrete gaps: compatibility, documentation, robustness, and execution consistency.

And I strongly prefer being there.

What part 2 really left me with

If part 1 was about why a multi-agent system made more sense than one giant conversation, part 2 is about something else: what happens when that idea leaves the page and meets reality.

That was where the real gaps showed up: permissions, effective orchestration, complexity management, providers that do not behave the same, processes, fragile defaults, and traces.

The original idea did not collapse. If anything, it came out stronger.

But now I see it more completely: a multi-agent architecture is not ready just because it looks good in the design. It is ready when it can actually run without breaking on basic things.

Closing

The first version taught me how to separate responsibilities.

This second stage forced me to build the runtime that makes that separation viable.

And real execution ended up teaching me the most important thing: between this runs and this works the way it should, there is a large distance. That distance is not closed with more theory. It is closed by running the system, observing where it fails, and correcting it with concrete changes.

That, for me, is what this second part is really about.

DEV Community