<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Taaar1k</title>
    <description>The latest articles on DEV Community by Taaar1k (@taaar1k).</description>
    <link>https://dev.to/taaar1k</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3884732%2Fa69c3382-0a2b-4ef3-b91f-19c7fc355e56.png</url>
      <title>DEV Community: Taaar1k</title>
      <link>https://dev.to/taaar1k</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/taaar1k"/>
    <language>en</language>
    <item>
      <title>I Built a 7-Agent Prompt Framework, Then Used It to Debug Its Own Output</title>
      <dc:creator>Taaar1k</dc:creator>
      <pubDate>Fri, 17 Apr 2026 16:06:42 +0000</pubDate>
      <link>https://dev.to/taaar1k/i-built-a-7-agent-prompt-framework-then-used-it-to-debug-its-own-output-4b3c</link>
      <guid>https://dev.to/taaar1k/i-built-a-7-agent-prompt-framework-then-used-it-to-debug-its-own-output-4b3c</guid>
      <description>&lt;p&gt;Last week I hit a loop that felt uncomfortably close to Ouroboros.&lt;/p&gt;

&lt;p&gt;I had built &lt;strong&gt;C.E.H.&lt;/strong&gt; — a 7-agent prompt framework (PM, Code, Scaut, Ask, Debug, Writer, Healer) designed to run on local LLMs. No API calls, no SaaS dependency. Just a set of orchestration rules and per-agent system prompts that force evidence-based behavior: no step marked "done" without a test result, a diff, or an &lt;code&gt;INSUFFICIENT_DATA&lt;/code&gt; tag.&lt;/p&gt;

&lt;p&gt;Then I pointed it at an empty directory and told the agents to build a RAG system. Over TASK-001 through TASK-016, they shipped the whole stack: hybrid retrieval (BM25 + vector with RRF), cross-encoder reranker, tenant isolation with audit logging, FastAPI surface, multimodal CLIP encoder, Graph RAG via Neo4j — 304 tests, most of them theirs. A few days in, the stress/crash suite turned red. Fourteen failures.&lt;/p&gt;

&lt;p&gt;So I did the obvious thing: &lt;strong&gt;I used C.E.H. to fix the C.E.H.-built code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The repo is public: &lt;a href="https://github.com/Taaar1k/rag-workshop" rel="noopener noreferrer"&gt;github.com/Taaar1k/rag-workshop&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This post is the story of what happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orchestrator&lt;/strong&gt;: Claude (me typing) acting as the PM seat — writing 13-section task files, reviewing evidence, closing tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Worker&lt;/strong&gt;: Qwen3-Coder-Next 80B MoE, served by llama.cpp on port 8080, driving a Roo Code extension.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target&lt;/strong&gt;: The &lt;code&gt;rag-project/&lt;/code&gt; repo — built by the same agents over TASK-001..TASK-016. 290 of 304 tests green when the stress/crash regressions surfaced.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule&lt;/strong&gt;: No agent can claim "done" without attaching an evidence bundle — &lt;code&gt;git diff&lt;/code&gt;, pytest summary line, or a documented &lt;code&gt;INSUFFICIENT_DATA&lt;/code&gt; gate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea was simple: the weak-model economics only work if the prompts do the reasoning and the model does the mechanics. C.E.H.'s job is to decompose a problem into mechanical steps a 7B–80B open model can execute reliably.&lt;/p&gt;

&lt;p&gt;That thesis got stress-tested immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  The first loop
&lt;/h2&gt;

&lt;p&gt;TASK-017 was "fix 6 failing tests in &lt;code&gt;test_crash_stress.py&lt;/code&gt;." Symptom: tests were calling &lt;code&gt;.search()&lt;/code&gt; and &lt;code&gt;.index_documents()&lt;/code&gt; on &lt;code&gt;HybridRetriever&lt;/code&gt;, but the real API is &lt;code&gt;.retrieve()&lt;/code&gt; and delegates indexing to the underlying retrievers. An API drift.&lt;/p&gt;

&lt;p&gt;I wrote the task in C.E.H.'s standard 13-section format: context, objective, scope, constraints, plan, risks, dependencies, DoD, change log. Then handed it to Qwen.&lt;/p&gt;

&lt;p&gt;Qwen got stuck in a loop. It ran the same pytest command three times in a row, each time "reproducing" the failure without making progress.&lt;/p&gt;

&lt;p&gt;Here's what was happening under the hood: the Roo Code → llama.cpp pipe doesn't preserve tool-call context the way Claude's harness does. The model would decide to reproduce, reproduce, then on the next turn lose the thread and decide to reproduce again.&lt;/p&gt;

&lt;p&gt;The fix wasn't model-side. It was prompt-side.&lt;/p&gt;

&lt;p&gt;I rewrote TASK-017 as &lt;strong&gt;prescriptive&lt;/strong&gt; — 8 explicit &lt;code&gt;find this / replace with this&lt;/code&gt; blocks. Instead of:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Replace &lt;code&gt;.search()&lt;/code&gt; calls with &lt;code&gt;.retrieve()&lt;/code&gt;."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I wrote:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;EDIT 3&lt;/strong&gt; — &lt;code&gt;test_hybrid_search_concurrent_queries&lt;/code&gt; (around line 318–325)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find this block:&lt;/strong&gt;&lt;/p&gt;


&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mock_vector_retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Mock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Result 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;strong&gt;Replace with:&lt;/strong&gt;&lt;/p&gt;


&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mock_vector_retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invoke&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Mock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Result 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;

&lt;p&gt;With the reasoning pre-compiled into the task, Qwen stopped looping. It executed the edits, wrote a full C.E.H.-format Debug Report, and moved on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 24 of 25 tests in that file went green. The one remaining failure was a real &lt;code&gt;MemoryPersistence&lt;/code&gt; bug, not an API mismatch — so it got tagged &lt;code&gt;INSUFFICIENT_DATA&lt;/code&gt; in the DoD and spawned TASK-020. That's the evidence gate working as designed: the agent didn't lie about completion, it reported a partial result with a boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  The regression
&lt;/h2&gt;

&lt;p&gt;TASK-020 was trickier. A test called &lt;code&gt;test_crash_during_save_recovery&lt;/code&gt; simulated a process crash by creating a fresh &lt;code&gt;MemoryPersistence&lt;/code&gt; instance over the same storage path. The fresh instance read back zero messages. The original fix attempt made &lt;code&gt;save_conversation&lt;/code&gt; write to disk even when &lt;code&gt;use_memory_fallback=True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That fixed the one test — and broke four others that specifically asserted "no disk writes when fallback is True."&lt;/p&gt;

&lt;p&gt;Classic Chesterton's Fence: the &lt;code&gt;use_memory_fallback=True&lt;/code&gt; branch existed for a reason. It was the fast-path for concurrent stress tests that didn't want disk I/O.&lt;/p&gt;

&lt;p&gt;The v2 task was three lines long to describe, one find/replace to execute:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Revert the disk-write change in &lt;code&gt;memory_persistence.py&lt;/code&gt; — keep the fallback in-memory.&lt;/li&gt;
&lt;li&gt;Flip &lt;code&gt;use_memory_fallback=True&lt;/code&gt; → &lt;code&gt;False&lt;/code&gt; in exactly the one test that simulates a restart — it genuinely needs disk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 51 of 51 persistence tests green.&lt;/p&gt;




&lt;h2&gt;
  
  
  The third loop
&lt;/h2&gt;

&lt;p&gt;Full suite run: 296 passed, 2 failed.&lt;/p&gt;

&lt;p&gt;The two failures were in &lt;code&gt;test_rag_server.py&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;test_rag_query_returns_200&lt;/code&gt; — &lt;code&gt;[Errno 111] Connection refused&lt;/code&gt;. The test was hitting a real llama-server on port 8090 that wasn't running.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;test_embedding_latency&lt;/code&gt; — &lt;code&gt;5.07s &amp;gt; 5.0s&lt;/code&gt; threshold. A genuine HuggingFace model download, 70 milliseconds over budget.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Neither was a logic bug. Both were environment dependencies. The right fix was &lt;code&gt;@pytest.mark.integration&lt;/code&gt; on both tests, plus raising the latency threshold to 10 seconds to account for cold-start on slow networks.&lt;/p&gt;

&lt;p&gt;Three prescriptive edits. Qwen executed them without drama.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final state&lt;/strong&gt;: 295 passed, 1 skipped, 8 integration-deselected, 0 failed. The suite can now run in a default CI environment without pulling llama-server or HF models at boot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I actually learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Weak models work when the reasoning is pre-compiled into the prompt.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Qwen failed on reasoning-heavy tasks. Once I moved the reasoning into the task file — exact find/replace blocks, explicit DoD clauses, forbidden actions listed — it executed cleanly. The failure wasn't "Qwen can't code." It was "Qwen can't plan-and-execute in a single pass when the context window keeps losing tool-call history."&lt;/p&gt;

&lt;p&gt;This is a testable claim. You can run it yourself: take any weak model, give it a vague task, watch it loop. Then give it a prescriptive task, watch it finish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Evidence gates are load-bearing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Three times in this debugging cycle, the agent surfaced &lt;code&gt;INSUFFICIENT_DATA&lt;/code&gt; instead of fabricating a completion:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TASK-017 found a memory-persistence bug it wasn't scoped to fix → spawned TASK-020.&lt;/li&gt;
&lt;li&gt;TASK-020 v1 caused a regression → got reverted, spawned v2.&lt;/li&gt;
&lt;li&gt;TASK-019 v1 missed four llama-dependent tests → got reopened as v2 and v3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of those was a point where a naive "done/not-done" contract would have silently declared victory and moved on. The framework caught it because "done" is defined as "evidence + DoD checks", not "the agent said so."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Honest reporting is stronger than automation theater.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The final changelog on this repo says, literally: &lt;em&gt;"Code (Qwen) applied 7 of the prescribed edits; Opus (human) completed the remaining 4 mock blocks."&lt;/em&gt; 80/20 split. Not "fully automated." Not "100% AI-built." The real ratio.&lt;/p&gt;

&lt;p&gt;That's more useful to you, reading this, than a shiny "100% autonomous" claim. You know exactly what the local model is and isn't good for.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's in the repo
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Taaar1k/rag-workshop" rel="noopener noreferrer"&gt;github.com/Taaar1k/rag-workshop&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full RAG implementation: hybrid retrieval (BM25 + vector RRF), cross-encoder reranker, tenant isolation with audit logging, multi-modal (CLIP) and Graph RAG (Neo4j) hooks.&lt;/li&gt;
&lt;li&gt;295 passing tests.&lt;/li&gt;
&lt;li&gt;The full C.E.H. task files under &lt;code&gt;ai_workspace/memory/TASKS/&lt;/code&gt; — TASK-017 through TASK-020-v3, with change logs. You can read the whole chain of what went wrong and how.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DEBUG_REPORT.md&lt;/code&gt; — the full Debug-agent artifact from TASK-017.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's not in the repo
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The C.E.H. framework itself (the 7 agent prompts, the task template, the evidence contract) is a separate product: &lt;a href="https://workshopai2.gumroad.com/l/ceh-framework" rel="noopener noreferrer"&gt;workshopai2.gumroad.com/l/ceh-framework&lt;/a&gt;. $19, one-time. If you want to use it in your own workflow, that's where it lives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm posting this to get feedback on the approach, not to sell framework licenses. If you read this and think &lt;em&gt;"this is just prescriptive prompting, dressed up"&lt;/em&gt; — tell me. If you think it's worth $19 to get the prompts assembled — also tell me.&lt;/p&gt;

&lt;p&gt;Either way, the repo is MIT-licensed and the lessons above are free. Take the parts you want.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The author runs a one-person AI workshop and ships things that break in public.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>rag</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
