As it happens, the last year for me was just big projects: Orome.ai, Relayn.sh. Multiple elements, shifting requirements, ever-growing codebases. Even having multiple layers of documentation did not help. I kept losing context, agents kept losing context, and at some point I started wondering if any of us actually knew what we were doing.
(Spoiler: the agents did not.)
This is about what I built to manage that, and why most of it exists specifically because agents will misbehave the moment you give them an opportunity.
History: tools, sprints, and learning the hard way
About a year ago I started to take AI-assisted programming seriously. The boost from Cursor and Claude was immediate. Windsurf appeared and then disappeared. I tried to always use "the best tool right now," and I lost an unreasonable amount of time re-calibrating to new tools, re-explaining my codebase, rebuilding configuration files.
Lesson one: stop chasing the best tool. Pick one. Get good at it. (I stayed with Cursor. All of this is about what I do with Cursor.)
I started running long sessions. Non-stop coding pushing twenty hours. Multiple tabs open, many agents running in parallel on different parts of the codebase. My take-out budget did not appreciate this. The codebase did. Sort of.
The next obvious upgrade: MCP memory. I hooked it in as an additional layer on top of local documentation. The agents used it maybe 30% of the time, when they felt like it. Great feature in theory. Agents kept forgetting it existed. I kept reminding them in CLAUDE.md. We were not making progress together on this one.
The CLI leap and the scale problem
Then Claude Code appeared. Then Cursor shipped its own CLI agent: much more responsive, much easier to control than IDE tabs. I moved most of my work there.
The CLI unlocked something real: I could spin a dedicated, remote dev machine with whatever memory I needed, keep all sessions live at the same time. No more fighting my laptop. Finally, no limit.
Then came the actual limit: context switching.
Twenty agents running in parallel sounds like a dream. In practice it means twenty things competing for your attention, and human context switching is exactly as bad as everyone says. I went from 20 agents to 8, then to 4. Four is manageable. Twenty means you are the bottleneck, which you are.
Duck is born: the prompt-as-document idea
I read an article on Hacker News. I cannot find it anymore to reference properly, which is exactly the kind of thing that happens when you are running too many things at once.
The idea: one markdown document serves as both the prompt for the agent and the communication channel between sessions. The agent reads it, writes its findings back into it, and the next session picks up exactly where it left off. No shared memory needed. No transcript parsing. The document is the memory.
That aligned with what I was already thinking about. I started building around it.
I call it Duck, as in Rubber Duck. You talk to it, it does not talk back in the way you expect, but having to explain the problem clearly enough for it to understand actually helps. (The bad joke budget is limited. I will spend it carefully.)
The last upgrade: a system, not a prompt
The Hacker News idea was the seed. What grew from it was something more structured.
I needed something that survived connectivity issues. Not everyone works with fiber optic under every chair; some of us run agents on a development machine in another region and the SSH session drops at 2 AM. I needed tmux. One session per project, one window per agent. I needed a research-to-implementation pipeline that could pause, resume, and not lose its mind between restarts.
The current system is called projects_control. It follows a two-directory model:
| Directory | Contents | Who works here |
|---|---|---|
projects_control/<project>/ |
Prompts, context, system config, session history | Human and orchestration scripts |
/home/ubuntu/projects/<project>/ |
Actual codebase | Agents |
These two directories never mix. Agents never work inside projects_control. Control artifacts never live in the codebase. This sounds obvious until you watch an agent helpfully create a planning document inside your src/ folder.
The system registers projects in a SQLite registry. Each project has a context.md (project architecture, credential references, current state) and a system.md (infrastructure config, MCP service list). Every prompt run syncs the agent's MCP config from system.md before it starts. Per-project MCP: the agent only sees the services that project actually needs.
Prompt types and their critics:
| Type | What the actor does | Critics |
|---|---|---|
| research | Investigate, produce findings | completeness, sourcing, conclusion_validity, actionability |
| plan | Produce implementation tasklist | architecture, contracts, backward_compatibility, dependencies, actionability, notation |
| implement | Write code | plan_compliance, correctness, code_quality |
Three types, three separate actors, three sets of critics. Research agents do not implement. Plan agents do not implement. Implement agents do not plan. In theory.
The critic loop
The runner for each prompt type is a bash script. Not an agent. Not an orchestrator with opinions. A bash script.
Here is what it does, in order:
| Step | What runs | Model | Output |
|---|---|---|---|
| Prompt gate | Checks if prompt is ready: clear task, bounded scope, proper structure | composer-1.5 | GATE_VERDICT: pass/fail |
| Actor | Reads prompt, writes Session Report | sonnet-4.6-thinking | Findings / plan / code changes |
| length_gate | Checks Session Report is not empty or suspiciously short | composer-1.5 | VERDICT_length_gate: ok/fix |
| Critics (sequential) | Type-specific evaluation | composer-1.5 | VERDICT_<name>: ok/fix |
| inject_feedback | Appends Critics Feedback section to the prompt file | bash | Prompt file updated in place |
| Repeat | Up to MAX_ITERATIONS=6 | -- | Final pass or cap notification |
The actor writes its output into the ## Session Report section of the prompt file. Critics read the same file and write verdict lines. inject_feedback appends their feedback as a new numbered section. The actor runs again, sees the feedback, improves. Up to six times. After six the runner sends a failure notification and stops.
The prompt file accumulates everything: actor output, critic feedback, iteration history. After six iterations it is a complete record of what happened. No need to dig through logs.
Each session is fully specified by two files: a context file (shared across all prompts in a project) and a prompt file (specific to one task). "The prompt file works as both an execution instruction and a final document." That phrase came out of documentation I wrote for someone else. It is accurate.
Guardrails: do not trust agents to not implement things
Here is the part I am not proud of. I had to build seven separate layers of protection to stop research and plan agents from implementing code.
Seven.
The previous system had one: a --mode plan CLI flag that was supposed to restrict the agent to read-only. The agent respected it sometimes.
| Guardrail | What it does | Failure mode prevented |
|---|---|---|
--mode plan CLI flag |
Sets read-only file access | Basic rogue writes |
| Worktree for research/plan | Agent works in a temp git branch, discarded on exit | Rogue edits contaminating main branch |
| No-implement instruction in actor prompt | Explicitly tells actor: do not edit codebase files | Agent creating files during "analysis" |
| Framing note in all plan critics | Critics phrase feedback as doc corrections, not step-execution commands | Critics accidentally triggering implementation |
| inject_feedback reminder | After every feedback round: "update Session Report only, do not implement" | Actors implementing in response to critic feedback |
| Prompt gate | Before actor starts: checks task is clear, scoped, and structured | Vague prompts leading to improvised scope expansion |
| check_readonly_guard downgrade | Worktree active: skip the old guard. Overlap case: log warning only | Brittle revert behaviour causing partial write issues |
The worktree approach is the honest solution. If the agent implements code in the worktree during a research session, the worktree is discarded on exit. No git checkout, no partial revert, no contaminated history. Containment instead of cleanup.
The no-implement instruction had to appear in three places: the actor prompt, the critic framing note, and the inject_feedback reminder. One place was not enough. Two was also not enough. I am not complaining. I am documenting. Or am I? ;) Buy me a glass of good red and I will tell you a sad little story how routine database backup ended up with one famous model shrinking database by 95% because it was too large to fit in proxy timeout window.
We had to explicitly tell research agents not to implement code. Repeatedly. This is the correct level of trust to extend to an agent on a large project.
The prompt file as artifact
The most useful thing in this system is not the critic loop or the guardrails. It is the prompt file itself.
Agents have no memory between sessions. The prompt file does. This is not a limitation. It is a design choice.
Building a complete history across many sessions without requiring any of them to have memory of each other: that is the goal. The prompt file accumulates actor output and critic feedback across every iteration. The human reads the same file to understand what happened. Context is pre-packed at design time: context.md plus the prompt file give the agent everything it needs to run without any external state.
No memory services required. No transcript parsing. The document is the memory.
What is actually hard
The human prompt.
If you want the system to work, every prompt has to be precise. You are planning structure, thinking about external systems, anticipating edge cases, security, optimisation. You are articulating an entire implementation in a couple of minutes. Do that for several tasks in quick succession and your brain starts to burn. Sometimes after ten minutes of back-to-back planning I cannot think straight.
But that is the job now. The system took away the busy work. What is left is the hard thinking. Token usage dropped by at least 75%. Sessions are compact. Even on a bad day nothing sprawls, because the discipline is in the structure, not in my attention span. The trade-off is that the work that remains is dense.
When a task outgrows what one prompt can carry, I bring in Shotgun -- a tool built by a befriended team -- and I have a second pair of cyber-eyes for big refactoring or entangling a new feature through legacy code. Between the critic loop and an external review tool, the system scales further than I expected.
Conclusion
I stopped trusting my coding agents. So I built a system to not trust them.
Agents are good at the task you give them, when you give them exactly that task and nothing else, run a critic to check they did it right, and remove every possible way for them to expand scope on their own. That is not a limitation. That is the entire design.
Bash scripts, prompt files, a SQLite registry. Not glamorous. It works because it does not pretend agents are collaborators. They are executors. The system around them does the collaborating.
Four agents. Clear prompts. Dumb runners. Critics that push back. One human who does the hard thinking.
That is the stack.
Top comments (0)