Dipesh Ray

Posted on May 18

Multi-Agent AI developer

#ai #webdev #programming #productivity

I Built a Multi-Agent AI Boardroom That Ships Software on Its Own — For Free

How I designed an autonomous LLM pipeline with a CEO, security officer, and QA team that produces real, browser-runnable projects up to 5 times a day — running entirely on GitHub Models.

The Idea

What if an AI system could not just write code, but manage the entire software engineering process itself?

Not a single model generating code. A hierarchy of specialised agents — each with a defined role, authority, and scope — collaborating like a real engineering organisation.

That's what autonomous-brain is.

It's a self-improving AI software-engineering pipeline. A boardroom of LLMs that autonomously designs, builds, security-reviews, and publishes brand-new browser-runnable projects — several times a day. Each one more complex than the last.

Live dashboard: dipeshrayg.github.io/autonomous-brain

Cost to run: $0 — powered entirely by GitHub Models.

The Architecture: An AI Boardroom

The key design decision was role separation. Most AI coding tools use one model doing everything. That produces mediocre generalist output. I wanted specialist agents that could critique each other.

Here's how the hierarchy works:

Strategic Layer (Long-Horizon Decisions)

CEO — gpt-4o, fires every 6 hours

Reviews the trajectory of recent projects. Issues strict directives: "stop building visualisers, explore simulation systems." The pipeline must obey.

CSO (Chief Security Officer) — gpt-4o, fires every 12 hours

Audits security posture across recent output. Issues directives like "all projects must sanitise user input" or "avoid eval()." These flow into every subsequent build.

Execution Layer (Per-Project)

VP Engineering — fires every 15 minutes

Decides whether a new project should be dispatched. Acts as a watchdog, enforcing cadence and complexity targets.

Architect Candidates — gpt-4o-mini + Phi-3.5-MoE in parallel

Two models independently propose project designs. Competition produces better ideas than a single proposal.

Chief Architect / Judge — gpt-4o

Reads both proposals, synthesises the strongest elements, and produces the final design spec. No rubber-stamping — it genuinely chooses.

Engineers — gpt-4o, one LLM call per file

Implement the spec. Each file is a separate call, keeping context focused and output quality high.

Code Reviewers — gpt-4o-mini + Phi-3.5-MoE in parallel

Two independent reviewers critique the output simultaneously. Results are merged.

Security Officer — gpt-4o

A hard gate. If the project has critical or high severity findings, it does not publish. No exceptions.

Fixer / Polisher — gpt-4o-mini

Applies reviewer feedback and runs a final polish pass before QA.

QA — Playwright + Chromium

Mechanical headless-browser verification. The project must load and pass basic interaction checks or it gets flagged.

Why Multi-Agent Instead of One Big Prompt?

Single-model approaches have a ceiling. When you ask one model to design, implement, review, and secure a project in one pass, context pollution degrades quality at every stage.

Multi-agent separation solves this:

Fresh context per role — the Security Officer reads the finished code, not the design discussion. It sees what a real attacker would see.
Adversarial review — two Architect Candidates competing produces better designs than one model agreeing with itself.
Hierarchy enforces consistency — CEO directives propagate down. The system doesn't just build random things. It has a direction.
Specialised models where appropriate — gpt-4o-mini is fast and cheap for parallel review passes. gpt-4o is reserved for judgement calls.

Self-Improvement: How the Complexity Grows

Each project is assigned a complexity score. The system tracks this over time.

The CEO reviews recent complexity trends. If projects are getting simpler or stagnating, it issues a directive forcing the next Architect to push harder.

Currently:

28 projects shipped
Peak complexity: 43 (open-ended scale)
Average complexity: 21.5

The system genuinely trends upward. Early projects were simple visualisers. Recent ones include multi-agent simulations, healthcare dashboards, and adaptive AI strategy games.

Running for Free: GitHub Models

Every model in this pipeline runs on GitHub Models — a free tier that gives access to gpt-4o, gpt-4o-mini, and open models like Phi-3.5-MoE via a standard OpenAI-compatible API.

No credit card. No rate limit issues at this scale. The only cost is the GitHub Actions runner time, which is also free within limits.

This means the entire system — from CEO strategic review to Playwright QA — costs $0 to run.

What It Has Built So Far

In 28 runs, the system has shipped projects across:

Mathematics — differential equation visualisers, fractal explorers
Healthcare — simulation dashboards, resource allocation tools
Environmental Science — climate data explorers
Arts — generative art engines, emergent pattern systems
Cybersecurity — visual cryptography tools, cipher simulators
Bioinformatics — sequence analysis tools
Game Design — adaptive AI strategy games
History — interactive timelines

Every single one is a browser-runnable project with a one-click demo. No setup. No dependencies. Open the link, it runs.

What I Learned

1. Role design is the hardest part.

Deciding what each agent knows, when it fires, and what authority it has took more iteration than any code. Get the roles wrong and agents either duplicate work or conflict.

2. Hard gates matter.

The Security Officer's hard veto was the best decision I made. Without it, the system published insecure projects that looked fine on the surface. The gate changed the architecture of subsequent projects — engineers started writing more defensively because they knew the gate existed.

3. Parallelism is underused in AI pipelines.

Running two Architect Candidates and two Code Reviewers in parallel added almost no latency (async calls) but meaningfully improved output quality. The Judge step pays for itself.

4. Complexity targets need a mechanism, not just a prompt.

Telling a model "make it more complex" doesn't work. Giving the CEO a tracked metric and authority to issue directives based on it does.

What's Next

Memory across projects — the CEO currently reviews recent output but doesn't have a long-term memory of what patterns have been overused globally
Engineer specialisation — specialist engineer agents for frontend, backend, and security rather than generalist engineers per file
Contributor mode — allowing external prompts to influence CEO directives
Open source the engine — autonomous-brain-engine (the Python orchestrator) will be open-sourced once I've cleaned the API key handling

Try It

Dashboard (all 28 projects, live demos): dipeshrayg.github.io/autonomous-brain

GitHub: github.com/dipeshrayg/autonomous-brain

If you build something inspired by this or have questions about the architecture, I'm @dipeshray on Dev.to — happy to discuss.

I'm a Computing Systems student in London working on autonomous AI systems and applied cryptography. This is part of my ongoing work on multi-agent architectures.

Tags: ai machinelearning opensource python beginners