Joseph Yeo

Posted on May 9

I Built a Local AI Coding Agent on M5 Max 128GB — It Failed 164 Times Before Passing 35 Tests

#llm #agents #tdd #ollama

Fully local. No cloud APIs during execution. TDD-enforced. 35 tests passing.

To be clear: I used Claude for the initial architecture and rule design. The experiment was strictly focused on whether a local LLM could survive the autonomous execution loop without phoning home. Planning, docs, and correction rule design I handled with Claude (a cloud API). The coding agent loop (Brain → Coder → Tester) ran 100% locally — no API calls during execution.

The Actual Setup

Most AI coding agent posts you see rely on GPT-4o or Claude via API. The model lives in a data center, and you're paying per token. That's fine — but it means your code, your architecture decisions, and your project context are all leaving your machine.

I wanted something different: a multi-agent system that runs entirely on my MacBook Pro M5 Max 128GB. It autonomously writes code, runs tests in a Docker sandbox, and only commits when tests pass. No internet required once it's running.

This is the story of ForgeFlow — what I built, what broke, and what the data showed.

Hardware Context

The M5 Max 128GB is unusual hardware for this kind of work. Most local LLM setups top out at 32GB or 64GB unified memory, which forces you to choose between model quality and running multiple models simultaneously. At 128GB, that constraint disappears.

What I ran simultaneously:

Model	Size (Q4_K_M)	Role
Qwen3-Coder-Next	~45GB	Brain + Coder
gemma4:26b	~17GB	QA
nomic-embed-text	~0.3GB	RAG embeddings

Total: ~62GB loaded, ~66GB headroom for OS + KV cache. Both models stay warm in memory with keep_alive: 24h — no reload latency between cycles.

This isn't a flex. It's context: the architectural decisions I made (same model for Brain and Coder, both models always loaded) are only feasible at this memory tier. At 64GB, you'd need to make different tradeoffs.

The Architecture: What ForgeFlow Actually Does

ForgeFlow is an n8n workflow that runs every 10 minutes, autonomously picks the next coding task, writes tests first, writes code second, and only commits if all tests pass.

The full loop:

Schedule Trigger (10 min)
  → Load Context (working memory + results log + project rules)
  → Brain (Qwen3-Coder-Next): pick next task from PRD
  → Localization: RAG search for relevant existing code
  → Coder RED (same model): write a failing test
  → Verify RED: pytest must FAIL — if it passes, the test is wrong
  → Coder GREEN: write minimum code to pass the test
  → Phase 0 Gate: py_compile + ruff (deterministic, no LLM)
  → QA (gemma4:26b): run full test suite in Docker sandbox
  → Gate Decision: COMMIT / RETRY / DEADLOCK / ESCALATE
  → Commit & Update (on pass)

Three design principles drove every decision:

1. pytest exit code is the only truth. I don't care if the LLM thinks the code is "clean." If the pytest exit code isn't 0, the code is garbage.

2. The LLM proposes, n8n disposes. No model has write access to the filesystem or git. n8n is the only actor that applies files, runs git commands, and updates state.

3. Deterministic gates before LLM gates. py_compile and ruff run in under 0.5 seconds. If they catch the error, there's no reason to spend 30 seconds calling gemma4.

The Memory System

One of the underrated problems in autonomous coding agents is state management across cycles. The agent can't remember what it did last cycle unless you explicitly store it.

ForgeFlow keeps track of state across six memory layers:

Layer	Storage	Scope
Git history	`.git`	Permanent
Code summaries	ChromaDB (RAG)	Project lifetime
results.tsv	TSV file	Session
AGENTS.md	Markdown	Cross-session
Working memory	JSON file	Current loop
Failure patterns	AGENTS.md auto-update	Generalized

Each layer operates at a different time scale. Git is permanent. Working memory resets every cycle. AGENTS.md accumulates lessons across sessions — when the same failure type occurs 3+ times, a rule gets written: "always include from app.database import get_db — the model consistently forgets this."

TDD Enforcement: Red-Green-Refactor as a System Constraint

The TDD loop isn't a suggestion — it's mechanically enforced by the workflow:

RED phase: Coder writes a test. n8n runs pytest. If it passes, the test is rejected — it's testing something that already works, which means it's the wrong test.
GREEN phase: Coder writes minimum code to pass the test. n8n applies the files, runs the full test suite (not just the new test), checks for regressions.
Commit: Only happens if exit code is 0 across the entire test suite.

Enforcing this mechanically means the model can't shortcut. It can't write "good enough" code and hope the reviewer misses it. The test either passes or it doesn't.

Failure Handling: Bounded Repair

Blind retries are a token-burn trap. Instead, ForgeFlow fingerprints every failure:

failure_signature = SHA256(failure_type + file_path + first_50_chars_of_stderr)[:12]

If I see the same SHA256 signature three times, the agent hits a DEADLOCK and walks away. It's better to skip a task than to let a model hallucinate in a loop.

Failure classification:

Type	Description
`patch`	Code logic error, syntax error
`environment`	Import error, missing module
`localization`	Wrong file referenced
`deadlock`	Same signature 3×

The Data: What Actually Happened

I ran ForgeFlow on a Todo REST API (FastAPI + SQLAlchemy + pytest) — 12 tasks, classic CRUD.

Overall:

Metric	Value
Total attempts	164
PASS (committed)	11 (6.7%)
FAIL (discarded)	116 (70.7%)
DEADLOCK (skipped)	37 (22.6%)
Manual interventions	3
Final test count	35 passing

The 6.7% raw PASS rate sounds bad. But that number is misleading — it includes the early cycles before deterministic corrections were added.

The real signal is in the pass rate as the system "learned" (via manual rules):

Corrections active	PASS rate
0–5 corrections	5.6%
6 corrections	0%
7–10 corrections	40.0%
11–13 corrections	62.5%

Each "correction" is a deterministic rule applied before the LLM output reaches the filesystem. Examples:

from app.db.session import → auto-rewrite to from app.database import
@router.post(...) without status_code=201 → auto-insert
File not in target_files → reject with error message

As corrections accumulated, PASS rate went from 5.6% to 62.5%. The corrections are essentially a hand-built knowledge base of the model's systematic errors. It turns out those errors are highly consistent and predictable.

Failure type distribution:

Type	Count	%
patch	99	64.7%
environment	54	35.3%

At 35.3%, my environment failure rate is triple the standard benchmarks (~13%). That's the "quantization tax" you pay for running Q4 models locally. The deterministic corrections target exactly these failure types.

Hardest tasks:

Task	Attempts	Primary failure
TASK-002 (DB model)	41	PytestDeprecationWarning + ImportError
TASK-006 (GET list)	26	ImportError conftest
TASK-012 (integration)	22	Regression (previous code overwritten)

TASK-002 taking 41 attempts is the starkest number. Most failures were the same PytestDeprecationWarning signature — the model couldn't fix a pytest configuration issue that required understanding the test infrastructure, not just the code under test. Eventually, a manual intervention resolved it.

What Broke (Honestly)

3 manual interventions were required:

TASK-008 (PUT endpoint): The Coder kept generating tests with wrong status codes. Added correction #13 (PUT 201→200 auto-fix) after diagnosing the pattern.
TASK-011 (filtering): The Coder overwrote routes/todo.py while working on filtering, destroying previously committed code. The target_files violation detection wasn't blocking writes — only logging them.
TASK-012 (integration test): DEADLOCK 3 times. The model couldn't figure out that test_integration.py needed to use the existing client fixture from conftest.py rather than creating its own TestClient.

All three were fixed in the session after they occurred by adding deterministic corrections. The system learned — just not automatically.

What This Is (and Isn't)

This is:

A proof that a fully local multi-agent TDD loop is viable on consumer hardware
Evidence that deterministic corrections significantly outperform raw LLM retry for systematic errors
A framework for thinking about autonomous coding at the task level

This isn't:

A "set it and forget it" system. It's a force multiplier that still requires a human to untangle the logic when the model hits a wall.
A system that works without oversight (3 interventions in 12 tasks is not zero)
Generalizable beyond the hardware tier that makes it feasible

The 62.5% PASS rate in the final correction set is meaningful. But the 3 required manual interventions mean the system isn't yet fully autonomous.

What's Next

Second project: A more complex backend (20+ tasks, non-trivial dependencies) to validate that the correction set generalizes and the dependency resolution logic holds under pressure. The goal is a two-project dataset for a proper write-up.

Phase 0.5 Gate: I'm looking at implementing AST-based checks — inspired by the Khati et al. (2026) paper — to kill hallucinations before they even hit the Docker sandbox. The goal is to catch app.routes.todo.get_todo_by_id (doesn't exist) before it reaches pytest.

Automatic correction learning: Right now, corrections are written manually after pattern identification. The next step is having n8n automatically identify recurring failure signatures and propose corrections for human approval.

Hardware Note for Other M-Series Users

If you're on M2/M3/M4 Pro (36–48GB), the same architecture works with tradeoffs:

Run one model at a time (swap between Brain/Coder and QA)
Use smaller QA model (gemma4:9b instead of 26b)
Expect higher latency per cycle (~15–20 min instead of 10)

The fundamental approach — deterministic orchestration + LLM proposal + test-as-truth — doesn't require 128GB. It just runs faster and with better models at that tier.

About

I'm Joseph YEO, a solo builder from Seoul, Korea. I run multiple projects in parallel using AI agents — local AI automation (ForgeFlow), supply chain security (DevRadar Guard), and a few things currently under wraps.

What I'm really interested in is how autonomous these agents can actually become before I have to step in as the human. ForgeFlow is one experiment. There will be more.

Follow along:

Built over ~7 sessions, May 2026. All models run locally via Ollama 0.23.0 on macOS. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me. I use AI tools to write, just like I use them to code. That's kind of the whole point.

DEV Community