TL;DR: Deep research the feature, write the documentation first, go YOLO, work backwards... Then magic. ✩₊˚.⋆☾⋆⁺₊✧
In my last post, I outlined how I was using Readme-Driven Development with LLMs. In this post, I will describe how I implemented a 50-page RFC over the course of a single weekend.
My steps are:
Step 1: Design the feature documentation with an online thinking model
Step 2: Export a description-only "coding prompt"
Step 3: Paste to an Agent in YOLO mode (--dangerously-skip-permissions
)
Step 4: Force the Agent to "Work Backwards"
Step 1: Design the feature documentation with an online thinking model
Open a new chat with an LLM that can search the web or do "deep research". Discuss what the feature should achieve. Do not let the online LLM write code. Create the user documentation for the feature you will write (e.g., README.md or a blog page). I start with an open-ended question to research the feature. That will prime the model. Your exit criteria is that you like the documentation or promotional material enough to want to write the code.
To exit this step, have it create a "documentation artefact" in markdown (e.g. the README.md or blog post). Save that to disk so that you can point the coding agent at it.
If you don't want to pay for a subscription for an expensive model, you can install Dive AI Desktop and use pay-as-you-go models of much better value. Here is a video on setting up Dive AI to do web research with Mistral:
Step 2: Export a description-only "coding prompt"
Next, tell the online model to "create a description only coding prompt (do not write the code!)". Do not accept the first answer. The more effort you put into perfecting both the markdown feature documentation and the coding prompt, the better.
If the coding prompt is too long, then the artefact is too big! Start a fresh chat and create something smaller. This is Augmented Intelligence ticket grooming in action!
Step 3: Paste to an Agent in YOLO mode (--dangerously-skip-permissions
)
Now paste in the groomed coding prompt and the documentation, and let it run. I always use a git branch so that I can let the agent go flat out. Cursor background agents, Copilot agents, OpenHands, Codex, Claude Code are becoming more accurate with each update.
I only restrict git commit
and git push
. I ask it first to make a GitHub issue using the gh
cli and tell it to make a branch and PR.
Step 4: Force the Agent to "Work Backwards"
The models love to dive into code, break it all, get distracted, forget to update the documentation, hit compaction, and leave you with a mess. Do not let them be a caffeine-fuelled flying squirrel!
The primary tool I am using now prints out a Todos list. This is usually the opposite of the correct way to do things safely!
Here is an edited version of a real Todo list to fix a bug with JDT "Match Any {}
"
⏺ Update Todos
⎿ ☐ Remove all compatibility mode handling
☐ Make `{}` always compile as strict
☐ Update Test_X to expect failures for `{}`
☐ Add regression test Test_Y
☐ Add INFO log warning when `{}` is compiled
☐ Update README.md with Empty Schema Semantics section
☐ Update AGENTS.md with guidance
That list is in a perilous order. Logically, it is this:
- Delete logic - so broken code, invalid old tests!
- Change logic - so more broken code, more invalid old tests!
- Change old tests - focusing on the old, not the new!
- Add one test - finally working on the new feature!
- Change the README.md and AGENTS.md - invalid docs used in steps 1-4!
If the agent context compacts, things go sideways, you get distracted, and you will end up with a bag of broken code.
So I set it to "plan mode", else immediately interrupt it, and force it to reorder the Todo list:
- Change the README.md and AGENTS.md first
- Add one test (insist the test is not run yet!)
- Change one test (insist the test is not run yet!)
- Add/Change logic (cross-check the plan with a different model!)
- Now run the tests
- Delete things last
That is a safe order where things are far less likely to be blown off course. I used to struggle with any feature that went beyond a single compaction; that is now far less of an issue.
Todos Are All You Need?
I am not actually a big fan of the built-in Todos
list of the two big AI labs. The models really struggle with any changes to the plan. The Kimi K2 Turbo appears to be more capable of pivoting. I have a few tricks for that, but I will save them for another post.
Does This Work For Real Code?
This past weekend, I decided to write an RFC 8927 JSON Type Definition validator based on the experiemental JDK java.util.json
parser. The PDF of the spec is 51 pages. There is a ~4000-line compatibility test suite.
We wrote 509 unit tests. We have the full compatibility test suite running. Yet we had bugs. We found them as we wrote a jqwik property test that generates 1000 random JTDs, and the corresponding JSON to validate, which uncovered several bugs. Codex also automatically reviewed the PRs and flagged some very subtle issues, which turned out to be real bugs. It took about a dozen PRs over the weekend to get the job done properly to a professional level.
End Notes
Using a single model family is a Bad Idea (tm). For online research, I alternate between full-fat ChatGPT Desktop, Claude Desktop, and Dive Desktop to utilise each of GPT5-High, Opus 4.1, or Kimi K2 Turbo.
For Agents, I have used all the models and many services. Microsoft kindly allows me to use full-fat Copilot with Agents for open-source projects for free ❤️ I have a cursor sub to use their background agents. I use Codex, Claude Code, and Gemini CLI locally. I use Codex in Codespaces. There are also background agents for Cursor, Codex, and OpenHands, among others. The actual model seems less important than writing the documentation first and writing tight prompts.
I am currently using an open-weight model at $3 per million tokens for the heavy lifting, which is pay-as-you-go. However, I will cross-check its plans with GPT5 and Sonnet 4.
Whenever things get complicated, I always ask a model from a different family to review every change on every bug hunt. That has reduced rework to almost zero. 💫
If you are a veteran, you may enjoy the YT channel Vibe Coding With Steve and Gene. My journey over the past year has been very similar to theirs.
End.
Top comments (5)
Great post! You're setup sounds a lot like mine, actually! 😃 If you haven't looked at Verdent yet, you should check it out!
While it is still very new, I really love the planning mode in this solution. I used to consistently track a local TODO.md file in every repo to work from. Combined with Deck (Verdent's orchestration app) it handles multi agent flows and it's three times faster than Copilot with higher quality results!
When you put that together with your brilliant setup, I think your weekend projects will need even more space!
I did try Verdent on my SpecKit project. I had used SoftGen.AI to make a prototype of a screen. I asked Verdent to clone the NextJS prototype and do a calendar prototype. It is "wow" as the screen is working great and Verdent is working nicely. Thanks!
Thanks for that Verdent looks great.
I just setup GitHub SpecKit yet there are some YT videos that suggest that it is very easy to go "full on waterfall" when doing all the product specs before coding. The video by Rob Shocks entitled "Spec Driven Development is Slowing You Down: Here’s a Better Way" says that he bailed out on full-on speckit to do more of what I outlined to "just get things done".
I suspect (know!) that as always "it depends" on the context of what we are doing. With the RFC I have a good idea of "how to got about it". Yet I am also doing something a bit more "out of my lane" with a UI for people who are not as geeky as me. So I am using Speckit to force me to ask potential users to propertly "research" features as "specs" before planning where I am likely to follow Rob's video and got a bit "staight to build" on the bits I can hold in my mind.
Thanks again I will try to squeeze a look at Verdent next week with a view to cutting over if it suite the little UI/mobile project that I am just kicking off.
Interesting to read
Very interesting read! It was nice meeting you last weekend (at the café in Wimbledon) and having this discussion in person. It would be great if you had any availability to connect and talk more around the other tools you're using and share ideas.