Dexmac

Posted on Feb 1

The Orchestrator's Workflow: How to Actually Work with AI Agents

There is a heated debate currently circulating in the software engineering world. On one side, purists argue that you must read every single line of code an AI generates, treating it with extreme suspicion. On the other, "accelerationists" trust the output blindly, shipping code they've never truly inspected because "it just works."

I recently read a piece comparing AI code generation to compiler abstraction—arguing that just as we don't read the Assembly code a C compiler produces, we shouldn't need to read the code an LLM produces. The premise is provocative, and it deserves a serious response before we dismiss it.

The answer isn't about blind trust, nor is it about paralyzing micromanagement. It is about understanding autonomy—and recognizing that we are in a transitional moment that demands a specific kind of skill.

In the automotive industry, there is a massive difference between Level 5 autonomy (the car has no steering wheel; you sleep in the back) and Level 3 autonomy (the car drives itself on the highway, but you must remain in the driver's seat, ready to take control).

Too many developers are treating AI like Level 5. They want to "fire and forget." But to build truly complex, robust systems today, we must operate at Level 3. We need to stop being just "coders" and start being Orchestrators.

The Compiler Argument: A Steelman and a Response

Let's take the compiler analogy seriously. The strongest version of this argument goes like this:

"Every layer of abstraction requires an act of faith. No one reads the Assembly output of their C compiler. No one audits the Linux kernel before deploying. Few developers read the source code of the libraries they import. Trust is already delegated across countless layers. Why should LLM-generated code be different?"

This is a valid point. We already operate in a world of delegated trust.

However, there are crucial differences between a compiler and an LLM:

Determinism: A compiler produces the same output for the same input, every time. An LLM does not.
Formal verification: Compilers are tested against rigorous specifications. LLMs are probabilistic black boxes.
Failure modes: A compiler fails loudly (syntax error, type mismatch). An LLM fails silently—it produces plausible-looking code that may be subtly wrong.

So the skeptics have a point, right? Not entirely.

The System, Not the Model

The mistake is focusing on the LLM in isolation. Modern AI Agents don't just generate code—they verify it. An Agent that:

Writes code (non-deterministic)
Compiles it (deterministic)
Runs tests (deterministic)
Iterates on failures (feedback loop)

...is a system with deterministic checkpoints. The generative process is stochastic, but the output converges toward something verified.

And let's be honest: humans aren't deterministic either. A pilot flying a commercial aircraft is subject to fatigue, bias, distraction, and error. Yet we trust pilots. Why? Not because they're infallible, but because they operate within a system of checklists, instruments, and co-pilots that catches mistakes.

Trust doesn't require determinism. It requires statistical reliability and recovery mechanisms.

The Agent + tooling system is starting to provide exactly that.

The Illusion of the "Magic Button"

That said, we're not at Level 5 yet.

If you ask an AI Agent to "build me a real-time object detection system on a Raspberry Pi optimized for C++," and then you walk away, you will fail. The AI might hallucinate a library, use an inefficient memory structure, or create a "black box" that works once but breaks under load.

This is where the "read every line" crowd has a valid point: The devil is in the details.

However, reading 1,000 lines of generated code to find that devil is inefficient. We become the bottleneck. The solution is not to read more syntax, but to change how we verify. We need to shift from reading code to verifying behavior and controlling the flow.

The Orchestrator's Workflow: A Case for Modularity

In my work with Computer Vision and LLMs, I have developed a workflow that moves away from line-by-line review and towards behavioral verification and modular checkpoints.

Here is what the "Level 3" workflow looks like in practice:

1. The Architect Phase (Brainstorming)

Before a single line of code is written, I use a high-level model (like Claude or Gemini) as a sounding board. We discuss the architecture. We validate the tech stack.

Human role: Define the goal, constraints, and scope.
AI role: Validate feasibility, suggest libraries, and outline the structure.

2. The Agent Phase (Modular Execution)

This is where most people go wrong. They write a prompt like "build me X" and wait for a miracle. That's not orchestration—that's wishful thinking.

The core principle: I don't ask the Agent to build the castle. I ask it to cut the stones.

Breaking Down the Problem

Every complex project can be decomposed into units that are:

Independently testable: I can verify this piece works without the rest
Small enough to debug: If it fails, I know where to look
Clear in scope: The Agent knows exactly what "done" means

For a computer vision pipeline, this might look like:

Step	Task	Verification
1	Write the training script in PyTorch	Model trains, loss decreases, no crashes
2	Export weights to ONNX format	File exports, loads correctly, outputs match
3	Write C++ inference loop	Compiles, runs, produces valid detections
4	Optimize for target hardware	Meets FPS threshold, memory within bounds

The Art of the Prompt

Each task needs a prompt that is specific but not micromanaging. There's a balance:

Too vague (Agent will make bad assumptions):

"Write the training code"

Too rigid (You're doing the Agent's job):

"Write a training script using PyTorch with Adam optimizer, learning rate 0.001, batch size 32, using CrossEntropyLoss..."

Just right (Clear goal, room for expertise):

"Write a PyTorch training script for a YOLOv8 object detection model. Use the dataset in /data/coco format. Include logging for loss metrics and save checkpoints every 10 epochs. Prioritize clean, readable code."

The Agent should have enough context to make intelligent decisions, but clear enough constraints to stay on track.

The Iteration Loop

Here's what actually happens in practice. It's never a single prompt:

First attempt: Agent produces something. Often 70-80% right.
I run it: Does it compile? Does it crash? What's the output?
Feedback: "The model trains but loss plateaus after epoch 5. Add learning rate scheduling and data augmentation."
Second attempt: Agent adjusts. Now it's 90% right.
Edge cases: "What happens with an empty batch? Add error handling."
Polish: "Add docstrings and type hints."

This dialogue is the actual work. The code is a byproduct.

The "Git" Gate

This is non-negotiable: I do not let the Agent proceed to step N+1 until step N is rigorously tested and committed.

When a module reaches a stable state:

I test it thoroughly—happy path, edge cases, failure modes
If it works, I commit with a clear message
That commit is a save point
We don't touch that module again unless a regression occurs

Why is this critical? Because Agents have no memory of what worked. If you let them refactor freely across the whole codebase, they will break things that were working. The Git gate creates islands of stability in a sea of iteration.

When the Agent Gets Stuck

Sometimes the Agent goes in circles. It fixes one thing, breaks another, fixes that, breaks the first thing again. This is the signal to grab the steering wheel.

Options:

Reframe the problem: "Stop. Let's take a different approach. Instead of X, try Y."
Provide documentation: "Here's the library docs. Use this specific function."
Reduce scope: "Forget optimization for now. Just make it work correctly first."
Split further: The task was still too big. Break it down more.

The Agent is not a genius that's temporarily confused. It's a powerful tool that needs direction. When it spins, you steer.

What "Verify & Test" Actually Means

"Testing" is not just "it runs without crashing." For each module, I define:

Functional correctness: Does it do what it's supposed to?
Edge cases: Empty inputs, malformed data, boundary conditions
Performance: Is it fast enough? Memory usage acceptable?
Integration: Does it play nice with the modules before and after it?

For the training script:

Does the loss actually decrease? (not just "no errors")
Can I load a checkpoint and resume?
Does it handle a corrupted image in the dataset gracefully?

For the ONNX export:

Does the exported model produce the same outputs as the PyTorch model?
Within what numerical tolerance?

For the C++ inference:

Does it detect objects correctly on test images?
What's the latency? What's the memory footprint?

This is where my experience matters—not in writing the code, but in knowing what questions to ask.

3. The "Steering Wheel" Intervention (A Real Case Study)

Recently, I tasked an Agent with deploying a custom object detection model to an embedded system (Raspberry Pi).

The Agent did exactly what I asked. It wrote a Python implementation. It worked. But it ran at 2 FPS.

A "Level 5" user (blind trust) would have stopped there, perhaps assuming the hardware wasn't powerful enough.
A "Level 0" user (skeptic) would have rewritten the whole thing by hand.

As a "Level 3" pilot, I grabbed the steering wheel. I knew the architecture needed a shift. I found a specific C++ optimization library tailored for ARM processors. I didn't write the code myself; I gave the documentation to the Agent and said: "Refactor the inference loop using this specific library and handle the memory pointers carefully."

The result? We jumped from 2 FPS to 20 FPS.

The Agent did the heavy lifting (the syntax, the boilerplate, the compilation), but I provided the direction.

From Vertical to Horizontal: A Clarification

This shift in workflow signals a deeper change in what it means to be an engineer.

For decades, society has pushed us toward Vertical Specialization. You are the expert of this specific screw in this specific engine. This creates "tunnel vision." When the world changes, the specialist often fails to adapt because they cannot see the context.

AI commoditizes vertical depth. It knows the syntax of Rust, C++, and Python better than I ever will. It knows the API documentation by heart.

This liberates us to embrace Horizontal Vision.

But let me be precise: AI commoditizes the execution of vertical expertise—writing the code, remembering the syntax, recalling the API. It does not yet commoditize the understanding.

In my Raspberry Pi case, I knew that 2 FPS was unacceptable. I knew that ARM-specific optimizations existed. I knew where to look. That knowledge is still vertical—but it's strategic vertical knowledge, not syntactic.

The "Level 3" Developer is a connector of dots:

We connect the hardware constraints to the software architecture.
We connect the business needs to the technical implementation.
We connect the ethical implications to the algorithmic design.

This requires depth—but depth of judgment, not memorization.

A Personal Confession

I want to say something that might sound strange coming from someone writing about developer expertise.

I've been doing this for 40 years. I had a VIC-20 and was programming on it at 8 years old. I've spent 25 years as a professional. And after all that time, I don't feel like I've internalized some deep, surgical precision about software.

I have intuitions. I have a sense of how things work. I have computational thinking and a handful of good patterns. But I am not a scalpel. I make mistakes. I miss things. I forget.

The value I bring is not infallibility—it's pattern recognition. I know when something "smells wrong." I know when to stop and say: "Wait, this doesn't convince me. Let's dig deeper." That's it. That's what 40 years gave me. A nose, not a scalpel.

I say this because I've seen too many people in this industry call themselves "great developers" while leaving behind codebases that tell a different story: rushed code, no comments, maybe OOP maybe not, spaghetti architecture held together by hope and caffeine.

And now, with AI, these same people have become instant geniuses.

They prompt an Agent, get a working output, and feel like architects. But the truth is: AI amplifies what you already are. If you had no discipline before, AI gives you faster chaos. If you had no architectural sense before, AI gives you more code to be confused by.

The humility to say "I don't fully understand this, let me verify" is more valuable than the confidence to say "ship it, it works."

The Steering Wheel Must Remain (For Now)

Ultimately, the argument shouldn't be about whether we read the code or run the tests. It should be about Responsibility.

An Agent has no concept of consequences. It doesn't care if a memory leak crashes a medical device, or if a bias in the dataset hurts a user. It doesn't care about the why, only the how.

That is why the steering wheel must exist—today.

We can use the autopilot on the highway of boilerplate code. We can let the Agent navigate the traffic of syntax errors. But when the road gets winding—when we deal with architecture, safety, and complex optimization—we must be the ones driving.

A Note on the Future

I want to be honest: I don't believe Level 3 is the permanent state of our profession.

Agents are improving rapidly. Their ability to self-verify, to catch their own mistakes, to ask clarifying questions, to reason about architecture—all of this is advancing. I estimate that within five years, the need for human "steering" will diminish dramatically. Perhaps it will disappear entirely for most tasks.

The Level 3 Developer is not the destination. It is the bridge.

Those who learn to operate at Level 3 now will develop something valuable: the intuition to know when systems are trustworthy and when they're not. That intuition will matter even when—especially when—we decide to let go of the wheel entirely.

Don't let the technology control you. Use it to become the architect you were always meant to be. And stay alert—because the road ahead is changing faster than any of us expected.

DEV Community