DEV Community: Rafael Tedesco

How to Build a Supervisor Agent Architecture Without Frameworks

Rafael Tedesco — Sat, 23 May 2026 02:31:03 +0000

A few days ago, I wrote about building an agentic pipeline from scratch in pure Python. The idea was intentionally simple: receive a task, invoke tools, and generate a response.

That architecture works surprisingly well for linear workflows. But real-world AI systems become complicated much faster than most tutorials suggest.

Eventually, one agent is no longer enough. A single reasoning loop starts accumulating too many responsibilities. The prompt grows uncontrollably, tool definitions pile up, execution logic becomes tangled, and debugging turns into a nightmare. What started as a clean “AI agent” slowly becomes a monolith trying to do everything at once.

This is where the Supervisor Pattern becomes useful.

Instead of relying on one giant agent, we introduce a central orchestrator responsible for coordinating specialized executors. Some executors may be tools, others may be reusable workflows, and others may be autonomous agents focused on a specific domain.

Conceptually, this is much closer to how modern AI systems operate internally. Systems like GitHub Copilot, Claude, and many enterprise AI platforms are not simply “one prompt talking to one model.” A significant part of the engineering complexity comes from orchestration: deciding what should execute, when it should execute, and how results should be combined.

In this article, we will build a simplified supervisor-based multi-agent architecture entirely in pure Python without relying on orchestration frameworks.

From Single Agents to Execution Runtimes

Most beginner agent tutorials follow roughly the same structure. A user sends a task, the agent reasons about it, invokes tools when necessary, and eventually produces an answer.

At small scale, this works well enough.

But imagine a more realistic request:

Research vector databases,
generate implementation examples,
analyze project files,
and review the generated solution.

A single agent now has to:

reason about research,
inspect files,
generate code,
review output,
manage tools,
maintain context,
and orchestrate execution order.

The problem is not necessarily model capability. The problem is architecture.

At some point, the “agent” stops being just a reasoning unit and accidentally becomes a workflow engine.

The Supervisor Pattern solves this by separating orchestration from execution.

Instead of one overloaded agent, we create specialized executors coordinated by a supervisor.

The architecture looks like this:

                    Supervisor
                         |
        +----------------+----------------+
        |                |                |
     Tools            Skills           Agents
        |                |                |
   File Search      Ticket Workflow   Coding Agent
   Web Search       Log Analysis      Research Agent

The important idea here is that the supervisor does not care whether it is invoking a tool, a workflow, or another agent. Everything follows the same execution contract.

Creating a Common Execution Interface

One of the simplest but most important architectural decisions is standardizing how execution works.

Instead of creating separate orchestration logic for tools, agents, and workflows, we can define a shared interface:

from abc import ABC, abstractmethod

class Executable(ABC):

    @abstractmethod
    async def execute(self, task: str, context: dict):
        pass

This abstraction becomes surprisingly powerful.

A search tool can implement it. A coding agent can implement it. A reusable workflow can implement it. To the supervisor, everything becomes just another executable component.

That greatly simplifies orchestration.

Building Specialized Executors

Now let’s create a few executors with different responsibilities.

We will start with a simple search tool:

class SearchTool(Executable):

    async def execute(self, task, context):

        return f"""
        Searching documentation for:
        {task}
        """

Then a file analysis tool:

class FileAnalysisTool(Executable):

    async def execute(self, task, context):

        return f"""
        Analyzing project files for:
        {task}
        """

These tools are intentionally small and focused. They represent atomic capabilities.

Now we can create specialized agents.

A research agent:

class ResearchAgent(Executable):

    async def execute(self, task, context):

        return f"""
        Research findings for:
        {task}
        """

A coding agent:

class CodingAgent(Executable):

    async def execute(self, task, context):

        return f"""
        Generated implementation for:
        {task}
        """

And finally, a review agent:

class ReviewAgent(Executable):

    async def execute(self, task, context):

        return f"""
        Code review completed for:
        {task}
        """

This separation of concerns is one of the biggest advantages of supervisor architectures. Each executor remains isolated and focused, which makes the overall system easier to scale and debug.

Dynamic Registration

Hardcoding dependencies quickly becomes painful as the number of executors grows.

Instead, we can create a registry capable of dynamically storing and discovering executors at runtime.

class Registry:

    def __init__(self):

        self.executables = {}

    def register(
        self,
        name,
        description,
        executable_type,
        instance
    ):

        self.executables[name] = {
            "description": description,
            "type": executable_type,
            "instance": instance
        }

    def get(self, name):

        return self.executables[name]

    def list(self):

        return self.executables

Now we can register everything dynamically:

registry = Registry()

registry.register(
    "search_tool",
    "Searches technical documentation",
    "tool",
    SearchTool()
)

registry.register(
    "file_analysis_tool",
    "Analyzes project files",
    "tool",
    FileAnalysisTool()
)

registry.register(
    "research_agent",
    "Performs research tasks",
    "agent",
    ResearchAgent()
)

registry.register(
    "coding_agent",
    "Generates code",
    "agent",
    CodingAgent()
)

registry.register(
    "review_agent",
    "Reviews implementations",
    "agent",
    ReviewAgent()
)

At this point, the system starts feeling much more like a runtime instead of a simple chatbot wrapper.

Building the Supervisor

The supervisor is the heart of the architecture.

Its responsibility is not necessarily to solve the task directly, but rather to decide which executors should participate in solving it.

We will start with a very simple planner:

class Supervisor:

    def __init__(self, registry):

        self.registry = registry

    async def plan(self, task):

        selected = []

        lower_task = task.lower()

        if "research" in lower_task:
            selected.append("research_agent")

        if "implement" in lower_task:
            selected.append("coding_agent")

        if "review" in lower_task:
            selected.append("review_agent")

        if "search" in lower_task:
            selected.append("search_tool")

        return selected

This planner is intentionally primitive. In production systems, this planning phase is often delegated to an LLM that returns structured execution plans.

But even this simplified version already demonstrates the architecture.

The supervisor receives a goal and decides dynamically which executors should participate.

Parallel Execution

Now comes the most interesting part.

Instead of executing everything sequentially, the supervisor can orchestrate independent tasks concurrently.

import asyncio

class Supervisor:

    def __init__(self, registry):

        self.registry = registry

    async def plan(self, task):

        selected = []

        lower_task = task.lower()

        if "research" in lower_task:
            selected.append("research_agent")

        if "implement" in lower_task:
            selected.append("coding_agent")

        if "review" in lower_task:
            selected.append("review_agent")

        if "search" in lower_task:
            selected.append("search_tool")

        return selected

    async def execute(self, task, context):

        selected = await self.plan(task)

        executions = []

        for name in selected:

            executable = self.registry.get(name)["instance"]

            executions.append(
                executable.execute(task, context)
            )

        results = await asyncio.gather(*executions)

        return results

This changes the nature of the system completely.

We are no longer building a linear tool-calling loop. We are building an orchestration runtime capable of coordinating distributed execution.

That distinction matters a lot.

Running the System

Now we can execute the entire pipeline:

async def main():

    supervisor = Supervisor(registry)

    result = await supervisor.execute(
        """
        Research vector databases,
        search implementation examples,
        implement a prototype,
        and review the generated code
        """,
        {}
    )

    for item in result:
        print(item)

asyncio.run(main())

The interesting part is not the mock outputs themselves. The interesting part is the orchestration model emerging underneath.

The supervisor analyzes the task, dynamically selects executors, parallelizes execution, and aggregates results back into a unified workflow.

That is much closer to how modern AI systems actually operate internally.

The Architectural Shift

At this point, the system has evolved far beyond a simple “AI chatbot.”

The supervisor is acting simultaneously as:

planner,
router,
scheduler,
orchestrator.

This is also why many production AI systems are significantly more complicated than “send prompt, receive response.”

A large portion of the engineering complexity comes from:

orchestration,
execution management,
concurrency,
state propagation,
retries,
failure isolation,
observability.

The model itself is only one piece of the system.

Moving Toward LLM-Based Planning

Our planner currently uses simple rule matching:

if "review" in task:
    selected.append("review_agent")

But modern systems usually replace this with an LLM planner capable of generating structured execution plans.

Something like:

prompt = f"""
Task:
{task}

Available executors:
- research_agent
- coding_agent
- review_agent
- search_tool

Return the executors required.
"""

The LLM might return:

{
  "executors": [
    "research_agent",
    "coding_agent",
    "review_agent"
  ],
  "parallel": true
}

At that point, the runtime becomes significantly more autonomous.

The supervisor is no longer following hardcoded execution paths. It is dynamically constructing execution graphs at runtime.

Production Realities

This is where things become genuinely difficult.

Once agents can invoke tools, workflows, and even other agents,
the runtime itself becomes the primary engineering challenge.

You suddenly need to think about:

recursion protection,
concurrency limits,
cancellation,
retries,
structured outputs,
tracing,
execution graphs,
timeout management.

For example, agents invoking agents can accidentally create infinite loops:

Supervisor -> ResearchAgent
ResearchAgent -> Supervisor
Supervisor -> ResearchAgent

You quickly realize that the difficult part of AI systems is often not model invocation.

The difficult part is building reliable orchestration around the model.

Final Thoughts

Once you understand the Supervisor Pattern, you stop thinking about AI agents as isolated chatbots.

You start thinking in terms of execution runtimes, orchestration graphs, distributed reasoning, and autonomous workflows.

That shift in perspective changes everything.

And interestingly, none of this requires a framework.

Underneath most orchestration libraries, the core execution model is still surprisingly simple:

await executable.execute(task, context)

Everything else is architecture layered on top.

Demystifying AI Agents: Building an Agentic Pipeline From Scratch in Pure Python

Rafael Tedesco — Thu, 21 May 2026 02:58:20 +0000

Most AI demos look impressive until you ask a simple question: What is actually happening under the hood?

Frameworks like LangChain, CrewAI, and Microsoft AutoGen make it incredibly easy to spin up an “AI agent” in a few lines of code. But abstractions come with a cost.

Many developers can build agents using frameworks without fully understanding the runtime architecture behind them. But what happens when something goes wrong and, because of the layers of abstraction, you don’t know exactly how to fix it? Or when you can’t even use those libraries and frameworks in your workplace and need to build an agentic application from scratch…

At their core, most agent frameworks are built around surprisingly simple primitives:

Prompt orchestration
Stateful memory
Tool execution
Control loops
Structured outputs

This week I was talking with a friend who wanted to understand how AI agents actually work under the hood. During that conversation, I realized something: most tutorials make AI agents feel far more mysterious than they really are.

Frameworks are great for moving fast, but they also hide many of the core mechanics behind layers of abstractions. You import a library, initialize an “agent,” attach a tool, and suddenly everything looks autonomous and intelligent. But underneath those abstractions, most agent systems are built on a surprisingly small set of concepts:

Prompts
Memory
Tool execution
Structured outputs
Control loops

So I decided to write the article I wish I had found when I first started exploring agentic systems. No heavy frameworks. No orchestration libraries. No hidden runtime magic. Just the core ideas, built step-by-step from scratch in pure Python.

In this article, we will strip away the abstractions and build a production-inspired agentic pipeline entirely from scratch using:

Pure Python
The standard library only
Native HTTP requests
No SDKs
No orchestration frameworks

By the end, you will understand the core mechanics behind modern AI agents and why most frameworks are essentially layered convenience abstractions over a deterministic execution loop.

What Is an Agentic Pipeline?

A standard LLM interaction is usually a single-shot transaction:

User Prompt ──> Model Response

The model receives context once and generates a static response. An agent, however, behaves differently. Instead of generating a single response, it operates inside a continuous execution cycle:

       ┌───────────────────────────────────────┐
       │                                       │
       ▼                                       │
[ THINK ] ───> (Decision) ───> [ ACT ] ───> [ OBSERVE ]
                               (Tool Call)   (Tool Result)

Think

The model evaluates the user objective, available tools, prior observations, and current memory state. It then decides what to do next.

Act

The agent executes an action. This could be calling a function, querying a database, searching the web, reading files, or returning a final answer.

Observe

The system captures the result of the action and feeds it back into the context window. The cycle repeats until the objective is complete.

A Helpful Mental Model

Think of an agent like a developer debugging a production issue:

Observe error logs
        │
        ▼
Form a hypothesis
        │
        ▼
  Run a command
        │
        ▼
 Inspect output
        │
        ▼
    Repeat

That iterative feedback loop is exactly how agentic systems operate.

Project Structure

We will organize the codebase into small, focused modules.

agentic-pipeline/
├── config.json       # Runtime configuration
├── llm_client.py     # Low-level HTTP client
├── memory.py         # Context/state manager
├── agent.py          # Agent orchestration engine
└── main.py           # Runtime execution loop

This separation mirrors how production systems are commonly structured.

Step 1 — Configuration Management

Avoid hardcoding runtime variables directly in code. For this demo we’re going to Create a config.json file just for demonstration purposes:

{
  "llm": {
    "provider": "openai",
    "model": "gpt-4o",
    "api_key": "sk-your-api-key",
    "temperature": 0.2,
    "max_tokens": 1024
  }
}

⚠️ Note: In production systems, credentials should come from environment variables or a secrets manager rather than static configuration files.

Step 2 — Building the Infrastructure Layer

Most SDKs hide the reality that every LLM interaction is just an HTTP request. Underneath the abstraction, the process is straightforward:

Serialize payload ──> Send HTTPS POST request ──> Receive JSON response ──> Parse output

Let’s implement that manually in llm_client.py.

import json
import urllib.request
import urllib.error
from typing import Dict, List

class LLMClient:
    def __init__(self, config: Dict):
        self.config = config["llm"]
        self.api_key = self.config["api_key"]

    def chat_completion(
        self,
        messages: List[Dict],
        temperature: float = None
    ) -> str:
        payload = {
            "model": self.config["model"],
            "messages": messages,
            "temperature": temperature or self.config.get("temperature", 0.2),
            "max_tokens": self.config.get("max_tokens", 1024)
        }

        data = json.dumps(payload).encode("utf-8")

        req = urllib.request.Request(
     "https://api.openai.com/v1/chat/completions",
            data=data,
            method="POST"
        )

        req.add_header("Content-Type", "application/json")
        req.add_header("Authorization", f"Bearer {self.api_key}")

        try:
            with urllib.request.urlopen(req) as response:
                result = json.loads(response.read().decode())
                return result["choices"][0]["message"]["content"].strip()
        except urllib.error.HTTPError as e:
            error_body = e.read().decode()
            raise Exception(f"LLM API error: {e.code} - {error_body}")

To understand what LLMClient is doing here, it helps to think of it like an old-school telegraph operator. This layer has no concept of reasoning, planning, or executing tools. It doesn't even manage memory. Its only job is to package up a stack of text, send it down the wire to the model, and hand you back the raw response. It moves the messages back and forth reliably without needing to understand a single word written inside them.

Step 3 — Managing Agent Memory

LLMs are stateless. They do not remember previous interactions unless the entire history is resent with every request. As the execution loop progresses, the context window continuously grows. We therefore need a lightweight memory manager in memory.py.

from typing import List, Dict

class AgentMemory:
    def __init__(self, max_messages: int = 20):
        self.messages: List[Dict] = []
        self.max_messages = max_messages

    def add(self, role: str, content: str):
        self.messages.append({
            "role": role,
            "content": content
        })

        if len(self.messages) > self.max_messages:
            # Preserve system prompt
            system_prompt = self.messages[0]

            # Slide conversation window
            active_history = self.messages[1:]
            self.messages = (
                [system_prompt] + 
                active_history[-(self.max_messages - 1):]
            )

    def get_messages(self) -> List[Dict]:
        return self.messages.copy()

    def clear(self):
        self.messages.clear()

If the LLM client is our telegraph operator, you can picture this memory manager like a detective's notebook. As the agent investigates a task, every tiny detail gets written down: the original user request, internal reasoning, tool choices, and the clues discovered along the way. Because the notebook can't hold infinite pages, the detective eventually has to archive old details while keeping the core investigation context front and center. That sliding window logic is exactly how we keep the context manageable.

Step 4 — Building the Agent Engine

This is where the orchestration logic lives. The agent must understand available tools, decide when to use them, parse structured outputs, execute functions, and feed observations back into memory. Let's write agent.py:

from llm_client import LLMClient
from memory import AgentMemory
from typing import Dict, Callable
import json

class Agent:
    def __init__(self, system_prompt: str, config_path: str = "config.json"):
        with open(config_path) as f:
            self.config = json.load(f)
        self.llm = LLMClient(self.config)
        self.memory = AgentMemory()
        self.system_prompt = system_prompt
        self.tools: Dict[str, dict] = {}

        self.memory.add("system", system_prompt)

    def register_tool(self, name: str, func: Callable, description: str):
        self.tools[name] = {
            "func": func,
            "description": description
        }

    def _get_tool_descriptions(self) -> str:
        if not self.tools:
            return "No tools available."
        return "\n".join([
            f"- {name}: {info['description']}"
            for name, info in self.tools.items()
        ])

    def think(self, user_input: str) -> str:
        self.memory.add("user", user_input)
        messages = self.memory.get_messages()
        tool_info = self._get_tool_descriptions()

        if self.tools:
            messages = messages.copy()
            enhanced_content = (
                f"{user_input}\n\n"
                f"AVAILABLE TOOLS:\n"
                f"{tool_info}\n\n"
                f"If you need a tool, respond ONLY with JSON:\n"
                f'{{"tool":"tool_name","args":{{}}}}\n\n'
                f"If the task is complete, respond naturally and include 'FINAL ANSWER'."
            )
            messages[-1]["content"] = enhanced_content

        response = self.llm.chat_completion(messages)
        self.memory.add("assistant", response)
        return response

    def act(self, response: str):
        if "{" in response and "}" in response:
            try:
                start = response.find("{")
                end = response.rfind("}") + 1
                tool_json = json.loads(response[start:end])

                tool_name = tool_json.get("tool")
                args = tool_json.get("args", {})

                if tool_name in self.tools:
                    result = self.tools[tool_name].get(“func”)(**args)
                    self.memory.add(
                        "system", 
                        f"Observation from '{tool_name}': {result}"
                    )
                    return result
            except Exception as e:
                error_msg = f"Tool execution failed: {str(e)}"
                self.memory.add("system", error_msg)
                return error_msg
        return None

This structural handoff brings up one of the most misunderstood parts of modern AI agents: the model does not execute your Python functions directly.

Instead, you are providing plain text descriptions of your local code inside the prompt. When the model reads these descriptions and decides it needs help, it simply formats its text output into a raw JSON block specifying a tool name and parameters. Your host application then catches that JSON, reads it, runs the native Python code locally, and passes the results back into the text history. The LLM itself remains entirely isolated, your local application serves as the actual execution environment.

Step 5 — The Runtime Control Loop

Without a runtime loop, the agent cannot perform multi-step reasoning. The host application must continuously drive execution forward. Let's look at main.py:

from agent import Agent
import time

def web_search(query: str) -> str:
    print(f"🔍 Searching index for: '{query}'")
    time.sleep(1)
    if "agentic ai" in query.lower():
        return (
            "Found: Modern agentic systems are moving away from rigid chains "
            "toward lightweight control loops and modular tools."
        )
    return (
        "Found: Building agents from scratch reveals implementation details "
        "often hidden by frameworks."
    )

if __name__ == "__main__":
    system_prompt = (
        "You are an autonomous operations assistant. "
        "Reason step-by-step. "
        "Use tools when necessary. "
        "When the task is fully complete, include the phrase FINAL ANSWER."
    )

    agent = Agent(system_prompt)
    agent.register_tool(
        name="search",
        func=web_search,
        description="Queries an index database. Input schema: {'query': str}"
    )

    task = "Research trends in agentic AI and explain why building from scratch is valuable."
    print(f"🎯 Objective: {task}")

    max_steps = 5
    for step in range(max_steps):
        print(f"\n[Cycle {step + 1}]")
        prompt = task if step == 0 else "Analyze previous observations and continue."

        response = agent.think(prompt)
        print(f"\n🤖 Agent:\n{response}")

        tool_output = agent.act(response)
        if tool_output:
            print(f"\n🛠 Observation:\n{tool_output}")

        if "final answer" in response.lower():
            print("\n✅ Objective completed.")
            break

Tracing the Runtime Execution

Here is a look at what happens internally during execution over two separate cycles:

Cycle 1

Think: The model receives the task, tool descriptions, and the initial system memory state. It realizes it lacks direct information about current trends.

Act: The model emits structured JSON:

{
  "tool": "search",
  "args": {
    "query": "latest trends in agentic AI"
  }
}

The runtime parses this block and executes the local Python web_search function.

Observe: The tool output gets appended back into memory. The model now has additional context to continue reasoning.

Cycle 2

The model reviews the original objective, prior observations, and tool outputs. It synthesizes a complete response and emits:


plaintext
FINAL ANSWER

The control loop detects this completion keyword and exits gracefully.

What You Actually Built

Underneath all the abstractions, you implemented a fully working pipeline:

Stateful memory
Tool registration
Structured tool calling
Runtime orchestration
Multi-step execution
Context management
Deterministic control flow

That is the foundation of nearly every modern agent framework.

Production Considerations

This implementation is intentionally minimal. Real production systems typically add:

Domain	Operational Mechanics
Resilience & Tracking	Retry policies, Token accounting, Observability & tracing
Data & Run Management	Parallel tool execution, Sandboxed runtimes, Rate limiting
Architecture Scaling	Distributed orchestration, Long-term memory persistence layers
Security & Safety	Guardrails and validation, Human approval checkpoints

Frameworks become valuable once these operational concerns grow large enough. But understanding the core loop first changes how you design AI systems.

Final Thoughts

AI agents can appear magical when hidden behind high-level abstractions. But once you strip away the layers, most systems reduce to a small set of deterministic building blocks: prompts, memory, tools, parsing, and loops.

Understanding those primitives gives you far more architectural control than blindly composing frameworks. Before introducing another dependency into your stack, it is worth asking:

“Do I actually need a framework here, or do I just need a well-designed control loop?”

If you can answer that question confidently, you already understand more about agentic systems than most developers using them today.

Modernizing Legacy Systems Using Agent Harnesses TDD and the Seam Model

Rafael Tedesco — Fri, 08 May 2026 23:08:16 +0000

Over the past few months, I’ve been investing a lot of time building agentic development workflows for real production environments.

Not only prompts.

Actual operational environments around agents.

Things like skills, execution tooling, validation layers, testing flows, memory handling, Git integrations, and constrained execution paths.

One thing became very clear very quickly.

Using agents in legacy or mission critical systems without a proper harness can become dangerous surprisingly fast.

Especially in financial systems.

Even with specification driven development (SDD), detailed tasks, and explicit instructions, I noticed a recurring problem.

The agent would correctly implement the requested functionality, but at the same time introduce large unintended changes across the codebase.

Not because the model was “bad”.

But because the environment still gave it too much freedom.

A small business change could suddenly trigger a massive refactor in tightly coupled parts of the application.

The functionality worked.

But reviewing the pull request became painful.

Risk analysis became harder.

The blast radius became unpredictable.

And in highly sensitive systems, this matters a lot.

The Shift That Changed Everything

To address this, I started combining a few ideas together:

TDD
Harness Engineering
The Seam Model from Michael Feathers
Constrained execution environments for agents
This changed the workflow completely.

Instead of letting the agent freely reshape large parts of the codebase, I started designing the environment to naturally constrain behavior.

The agent now operates through a harness I built around it.

This harness provides structured skills and controlled capabilities such as:

Reading specific files
Analyzing code diffs
Running tests incrementally
Validating architectural constraints
Checking impacted dependencies
Generating isolated implementations
Blocking risky operations

One of the biggest improvements came from applying the Seam Model mindset.

“A seam is a place where you can alter behavior in your program without editing in that place.”
— Michael Feathers, Working Effectively with Legacy Code

Instead of modifying deeply coupled code directly, the agent identifies stable seams where behavior can be isolated safely.

Then new functionality gets introduced incrementally behind those seams.

This dramatically reduces unintended side effects.

Critique and Validation Skills

Another important part of the harness is the critique and validation layer.

The agent is not only responsible for generating code.

It also needs to review its own changes against explicit acceptance criteria and architectural constraints.

I created specialized skills focused on critique workflows, where the agent analyzes the generated diff and verifies things like:

Did the implementation fully satisfy the acceptance criteria?
Did the agent modify unrelated modules?
Did it introduce unnecessary refactors?
Did it violate architectural boundaries?
Did it expand the blast radius beyond the intended scope?
This changes the workflow significantly.

Instead of treating code generation as the final step, generation becomes only one phase inside a larger controlled execution pipeline.

In practice, this dramatically improves reviewability and reduces the risk of unintended modifications in legacy or mission critical systems.

Practical Example

Imagine a legacy financial reconciliation service.

A new business rule needs to be introduced into the settlement calculation flow.

Without constraints, the agent might attempt to “improve” the architecture while implementing the feature.

Suddenly:

Shared abstractions get rewritten
Core flows get reorganized
Multiple services are refactored together
Dozens of unrelated files change

Technically impressive.

Operationally dangerous.

With the harnessed approach, the flow becomes very different.

The agent:

Identifies stable seams in the codebase
Creates isolated extension points
Implements behavior incrementally
Runs targeted tests after every step
Validates architectural boundaries
Restricts modifications outside approved scopes
Critiques its own generated diff against acceptance criteria

The final result is much smaller, easier to review, safer to deploy, and significantly more predictable.

The Most Interesting Part

What surprised me most is that the value was not only personal productivity.

The biggest impact came after I shared these agents, skills, and harness environments with the engineering teams I lead.

Now the entire team benefits from the same operational guardrails.

Developers can leverage the toolkit to:

Reduce risky refactors
Improve reviewability
Increase delivery confidence
Work more safely in legacy systems
Move faster without increasing instability

This starts creating organizational leverage, not just individual acceleration.

And honestly, this is where I believe a huge part of software engineering is heading.

The conversation is moving far beyond prompt engineering.

The real challenge is designing reliable operational environments where agents can safely participate in software delivery pipelines.

Especially in systems where reliability matters more than raw speed.

Final Thoughts

I don’t think agents replace engineering discipline.

Actually, I think they amplify the importance of it.

The better the engineering foundations, the more powerful these systems become.

TDD becomes more important.
Architectural boundaries become more important.
Observability becomes more important.
Validation becomes more important.
Harness design becomes more important.

The model is only one part of the system.

The environment around it is what determines whether the outcome is production ready or operational chaos.