Taaar1k

Posted on Apr 17

I Built a 7-Agent Prompt Framework, Then Used It to Debug Its Own Output

#ai #llm #opensource #rag

Last week I hit a loop that felt uncomfortably close to Ouroboros.

I had built C.E.H. — a 7-agent prompt framework (PM, Code, Scaut, Ask, Debug, Writer, Healer) designed to run on local LLMs. No API calls, no SaaS dependency. Just a set of orchestration rules and per-agent system prompts that force evidence-based behavior: no step marked "done" without a test result, a diff, or an INSUFFICIENT_DATA tag.

Then I pointed it at an empty directory and told the agents to build a RAG system. Over TASK-001 through TASK-016, they shipped the whole stack: hybrid retrieval (BM25 + vector with RRF), cross-encoder reranker, tenant isolation with audit logging, FastAPI surface, multimodal CLIP encoder, Graph RAG via Neo4j — 304 tests, most of them theirs. A few days in, the stress/crash suite turned red. Fourteen failures.

So I did the obvious thing: I used C.E.H. to fix the C.E.H.-built code.

The repo is public: github.com/Taaar1k/rag-workshop.

This post is the story of what happened.

The setup

Orchestrator: Claude (me typing) acting as the PM seat — writing 13-section task files, reviewing evidence, closing tasks.
Worker: Qwen3-Coder-Next 80B MoE, served by llama.cpp on port 8080, driving a Roo Code extension.
Target: The rag-project/ repo — built by the same agents over TASK-001..TASK-016. 290 of 304 tests green when the stress/crash regressions surfaced.
Rule: No agent can claim "done" without attaching an evidence bundle — git diff, pytest summary line, or a documented INSUFFICIENT_DATA gate.

The idea was simple: the weak-model economics only work if the prompts do the reasoning and the model does the mechanics. C.E.H.'s job is to decompose a problem into mechanical steps a 7B–80B open model can execute reliably.

That thesis got stress-tested immediately.

The first loop

TASK-017 was "fix 6 failing tests in test_crash_stress.py." Symptom: tests were calling .search() and .index_documents() on HybridRetriever, but the real API is .retrieve() and delegates indexing to the underlying retrievers. An API drift.

I wrote the task in C.E.H.'s standard 13-section format: context, objective, scope, constraints, plan, risks, dependencies, DoD, change log. Then handed it to Qwen.

Qwen got stuck in a loop. It ran the same pytest command three times in a row, each time "reproducing" the failure without making progress.

Here's what was happening under the hood: the Roo Code → llama.cpp pipe doesn't preserve tool-call context the way Claude's harness does. The model would decide to reproduce, reproduce, then on the next turn lose the thread and decide to reproduce again.

The fix wasn't model-side. It was prompt-side.

I rewrote TASK-017 as prescriptive — 8 explicit find this / replace with this blocks. Instead of:

"Replace .search() calls with .retrieve()."

I wrote:

EDIT 3 — test_hybrid_search_concurrent_queries (around line 318–325)

Find this block:

mock_vector_retriever.search = Mock(return_value=[
    (0, Document(page_content="Result 1", metadata={"score": 0.95}), 0.95),
])

Replace with:

mock_vector_retriever.invoke = Mock(return_value=[
    Document(page_content="Result 1", metadata={"score": 0.95}),
])

With the reasoning pre-compiled into the task, Qwen stopped looping. It executed the edits, wrote a full C.E.H.-format Debug Report, and moved on.

Result: 24 of 25 tests in that file went green. The one remaining failure was a real MemoryPersistence bug, not an API mismatch — so it got tagged INSUFFICIENT_DATA in the DoD and spawned TASK-020. That's the evidence gate working as designed: the agent didn't lie about completion, it reported a partial result with a boundary.

The regression

TASK-020 was trickier. A test called test_crash_during_save_recovery simulated a process crash by creating a fresh MemoryPersistence instance over the same storage path. The fresh instance read back zero messages. The original fix attempt made save_conversation write to disk even when use_memory_fallback=True.

That fixed the one test — and broke four others that specifically asserted "no disk writes when fallback is True."

Classic Chesterton's Fence: the use_memory_fallback=True branch existed for a reason. It was the fast-path for concurrent stress tests that didn't want disk I/O.

The v2 task was three lines long to describe, one find/replace to execute:

Revert the disk-write change in memory_persistence.py — keep the fallback in-memory.
Flip use_memory_fallback=True → False in exactly the one test that simulates a restart — it genuinely needs disk.

Result: 51 of 51 persistence tests green.

The third loop

Full suite run: 296 passed, 2 failed.

The two failures were in test_rag_server.py:

test_rag_query_returns_200 — [Errno 111] Connection refused. The test was hitting a real llama-server on port 8090 that wasn't running.
test_embedding_latency — 5.07s > 5.0s threshold. A genuine HuggingFace model download, 70 milliseconds over budget.

Neither was a logic bug. Both were environment dependencies. The right fix was @pytest.mark.integration on both tests, plus raising the latency threshold to 10 seconds to account for cold-start on slow networks.

Three prescriptive edits. Qwen executed them without drama.

Final state: 295 passed, 1 skipped, 8 integration-deselected, 0 failed. The suite can now run in a default CI environment without pulling llama-server or HF models at boot.

What I actually learned

1. Weak models work when the reasoning is pre-compiled into the prompt.

Qwen failed on reasoning-heavy tasks. Once I moved the reasoning into the task file — exact find/replace blocks, explicit DoD clauses, forbidden actions listed — it executed cleanly. The failure wasn't "Qwen can't code." It was "Qwen can't plan-and-execute in a single pass when the context window keeps losing tool-call history."

This is a testable claim. You can run it yourself: take any weak model, give it a vague task, watch it loop. Then give it a prescriptive task, watch it finish.

2. Evidence gates are load-bearing.

Three times in this debugging cycle, the agent surfaced INSUFFICIENT_DATA instead of fabricating a completion:

TASK-017 found a memory-persistence bug it wasn't scoped to fix → spawned TASK-020.
TASK-020 v1 caused a regression → got reverted, spawned v2.
TASK-019 v1 missed four llama-dependent tests → got reopened as v2 and v3.

Every one of those was a point where a naive "done/not-done" contract would have silently declared victory and moved on. The framework caught it because "done" is defined as "evidence + DoD checks", not "the agent said so."

3. Honest reporting is stronger than automation theater.

The final changelog on this repo says, literally: "Code (Qwen) applied 7 of the prescribed edits; Opus (human) completed the remaining 4 mock blocks." 80/20 split. Not "fully automated." Not "100% AI-built." The real ratio.

That's more useful to you, reading this, than a shiny "100% autonomous" claim. You know exactly what the local model is and isn't good for.

What's in the repo

github.com/Taaar1k/rag-workshop

Full RAG implementation: hybrid retrieval (BM25 + vector RRF), cross-encoder reranker, tenant isolation with audit logging, multi-modal (CLIP) and Graph RAG (Neo4j) hooks.
295 passing tests.
The full C.E.H. task files under ai_workspace/memory/TASKS/ — TASK-017 through TASK-020-v3, with change logs. You can read the whole chain of what went wrong and how.
DEBUG_REPORT.md — the full Debug-agent artifact from TASK-017.

What's not in the repo

The C.E.H. framework itself (the 7 agent prompts, the task template, the evidence contract) is a separate product: workshopai2.gumroad.com/l/ceh-framework. $19, one-time. If you want to use it in your own workflow, that's where it lives.

I'm posting this to get feedback on the approach, not to sell framework licenses. If you read this and think "this is just prescriptive prompting, dressed up" — tell me. If you think it's worth $19 to get the prompts assembled — also tell me.

Either way, the repo is MIT-licensed and the lessons above are free. Take the parts you want.

The author runs a one-person AI workshop and ships things that break in public.

DEV Community