Dr. B

Posted on May 6 • Edited on May 8

I Built a Multi-Agent Coding System From Scratch in Python (No Frameworks)

#programming #ai #python #machinelearning

Most multi-agent AI tutorials hand you LangChain, AutoGen, or CrewAI and say “here you go.” You wire a few abstractions together, get something running, and never really understand what’s happening under the hood.
I wanted to understand what’s actually happening under the hood. So I built one from scratch.
This is the story of multi-agent-coder which is a system where a Planner decides at runtime which AI agent to call next, and a team of specialized agents (Architect, Engineer, Critic, TestRunner, Refactorer) collaborates to turn a plain English request into working, tested Python code.

No LangChain. No AutoGen. Pure Python.

The Core Idea

The central insight is simple: one LLM trying to do everything is worse than multiple LLMs each doing one thing well.
A single prompt asking an AI to “plan the architecture, write the code, review it for bugs, and refactor it” produces mediocre results across all four. But if you give each job to a separate agent with a focused system prompt and its own memory, the output quality goes up dramatically.
The challenge is who decides which agent runs next?

That’s where the Planner comes in.

Architecture Overview

The Planner receives the full current state (user request + each agent’s memory) as a JSON blob and responds with a JSON decision:

{
 "next_agent": "Engineer",
 "message": "Implement the file structure from the Architect's plan",
 "reason": "Architecture is complete, time to write code"
}

This is the architectural heart of the system. Instead of hardcoding a sequence, the Planner reasons about what needs to happen next. It can send work back to the Engineer after the Critic finds a bug. It can skip the Refactorer if the code is already clean. It decides when to stop.

The Agents

Architect

Takes the user’s request and produces a concrete plan: file structure, module breakdown, and step-by-step engineering approach. It writes the blueprint.

Engineer

Reads the plan (and any Critic feedback) and writes actual Python files. It outputs code in fenced blocks labeled with filenames, which the controller parses and writes to disk automatically.

Critic

Reviews the generated code against the original plan. It checks for correctness, edge cases, and consistency. It just gives structured feedback for the Engineer to act on.

TestRunner

Runs pytest on the workspace and reports results.

Refactorer

Once the code passes tests and gets Critic approval, the Refactorer makes a final pass for code quality: naming, structure, clarity, removing redundancy.

Memory Management

Each agent has its own memory namespace. The MemoryManager tracks:

Agent memory - each agent’s last output (plan, code, review, etc.)
Loop memory - shared state like last test output and last review
Project memory - file list and overall project context

Every time the Planner makes a decision, it sees the full current state. This is what allows it to reason across multiple loops because it knows the Critic flagged a bug last round, and it knows the Engineer already tried to fix it once.

state = {
   "user_request": user_request,
   "project_memory": memory.get_project(),
   "loop_memory": memory.get_loop(),
   "architect_memory": memory.get_agent("Architect"),
   "engineer_memory": memory.get_agent("Engineer"),
   "critic_memory": memory.get_agent("Critic"),
   "refactorer_memory": memory.get_agent("Refactorer"),
}

Why “From Scratch”?

Frameworks like LangChain are powerful, but they hide the orchestration logic behind layers of abstraction. When something breaks, debugging is painful. When you want to customize, you’re fighting the framework.
Building from scratch means:

Every line is intentional - you understand why it’s there
The orchestration logic is transparent - the controller.py is ~170 lines and readable top to bottom
It’s easy to extend - adding a new agent is just writing a new class and teaching the Planner about it
It’s a great learning tool - if you’re teaching agentic AI patterns, this is the codebase you want

What I Learned

1. The Planner is the hardest part.
Getting the Planner’s system prompt right took the most iteration. It needs to reliably return valid JSON, reason about state correctly, and know when to stop. Handling json.JSONDecodeError and retrying is essential.
2. Shared memory is both the strength and the risk.
Agents producing more context each loop is useful, but if memory grows unbounded, you hit token limits fast. Thoughtful memory summarization is a real engineering challenge.
3. The Critic loop is where quality emerges.
The Engineer → Critic → Engineer loop is where the magic happens. A single Engineer pass produces okay code. Three loops produce something genuinely good. This mirrors how human code review works.
4. TestRunner grounds the whole system.
Without real test execution, agents can convince themselves the code works when it doesn’t. Plugging pytest into the loop and feeding actual failure output back to the Engineer is what makes the system reliable.

What’s Next

Benchmarking against HumanEval to get quantitative results
Smarter memory summarization to handle longer projects
A web UI to visualize the agent loop in real time
Potential IEEE paper on the Planner-driven dynamic routing approach

Try It

The full code is on GitHub: github.com/zosob/multi-agent-coder
It’s intentionally minimal and transparent. Fork it, add your own agent, break things, understand them. That’s the point.
If you have questions, thoughts, or want to collaborate on the benchmarking work, please drop a comment or open an issue. Be kind!

Built with pure Python and curiosity. No frameworks harmed in the making of this system.

DEV Community