I have spent 88 sessions building a software engineering tool using AI — and I used the tool itself for every session. Here is what I learned.
The tool is called Wrought. It is a structured engineering process for AI coding assistants. Think of it as an engineering runbook that your AI assistant actually follows: pipelines for bug investigation, design analysis, implementation, and code review, all producing documentation that builds your project's institutional memory.
The unusual part is not the product. It is the method. Every feature, every bug fix, every architectural decision in Wrought was built using Wrought's own process. Dogfooding at its most literal. The tool that enforces design-before-code was designed before it was coded. The skill that generates findings trackers was tracked in a findings tracker. The code review system reviewed itself.
88 sessions. 201 commits. 87 days from first commit to this post. Here is what the numbers say about building with AI — and what they leave out.
The Numbers
Before the lessons, the raw data. These are not estimates; they are counted from the Git history and file system.
| Metric | Count |
|---|---|
| Sessions (Claude Code conversations) | 88 |
| Git commits | 201 |
| Findings trackers created | 47 |
| Design documents | 60 |
| Blueprints | 55 |
| Research reports | 33 |
| Code reviews (up-to-5-agent parallel review) | 23 |
| Investigation reports | 16 |
| RCA reports | 20 |
| Implementation prompts | 75 |
| Plans | 16 |
| Lines of Python (source) | ~2,300 |
| Lines of Python (tests) | ~3,100 |
| Skills (structured AI workflows) | 15+ |
| Files changed since first commit | 1,016 |
| Lines inserted | 193,000+ |
| Calendar days (Jan 26 to Apr 23) | 87 |
A few things jump out immediately. There are more lines of test code than production code. There are more design documents than there are commits to implement them. And 193,000 lines of insertions for a 2,300-line Python CLI means the overwhelming majority of the project is documentation, methodology artifacts, and process records — not source code.
That ratio — 84 to 1 — is the story.
Three Things That Worked
1. Cross-Session Memory Via Structured Artifacts
This is the single most valuable pattern I discovered.
AI coding assistants have a fundamental problem: they forget everything between sessions. Claude Code has auto-memory and CLAUDE.md, which help, but they are lossy. They capture vibes and preferences, not the state of a six-step pipeline with four open findings across three trackers.
The pattern that solved this is what I call the Findings Tracker. It is a markdown file — nothing fancy — that tracks every significant piece of work through a structured lifecycle:
Open -> Investigating/Designing -> Blueprint Ready -> Planned -> Implementing -> Resolved -> Verified
Each tracker has a dependency map, resolution tasks with checkboxes, lifecycle timestamps, and links to every artifact produced along the way. When a new session starts, the AI reads the tracker and knows exactly where work was interrupted, what has been tried, and what comes next.
Here is a real example. The "Context Compaction Resilience" tracker (docs/findings/2026-03-01_1600_context_compaction_resilience_FINDINGS_TRACKER.md) tracked a problem where Claude Code's auto-compaction would destroy in-flight state during long sessions. It spawned 5 sub-findings across a 5-layer defense architecture:
- F1: No compact instructions in CLAUDE.md (solved: added a section the compactor reads)
- F2: Context percentage data was siloed in the display (solved: bridged to a file)
- F3: No last-chance backup before compaction (solved: PreCompact hook)
- F4: No automated context threshold alerts (solved: Stop hook at 70% warn, 80% block)
- F5: Context calculation was inaccurate (solved: fixed the math)
This work spanned 4 sessions. Without the tracker, each session would have started from scratch, rediscovering what had been tried. With it, every session picked up exactly where the last one left off.
I have 47 of these trackers. They are the project's institutional memory. Not AI-generated summaries — structured records with dependency maps, resolution tasks, and lifecycle stages.
2. Design-First Pipeline (Even When Code Is Cheap)
The most counterintuitive thing about building with AI: the faster code generation gets, the more design matters.
When generating code costs effectively nothing, the temptation is to skip analysis and start implementing. In the first few sessions, that is exactly what happened. And it produced mediocre results. The AI would generate code that worked but was architecturally questionable, or that solved the wrong problem, or that solved the right problem in a way that made the next feature harder to build.
The pipeline that emerged — and that Wrought now enforces — is:
/research -> /design -> /blueprint -> /wrought-implement -> /forge-review
Every feature starts with research (what exists, what are the constraints). Then a design analysis that evaluates multiple options with a structured tradeoff matrix. Then a blueprint with exact file specifications and acceptance criteria. Only then does implementation begin.
Here is what this looks like in practice. When I needed to set up the development environment (Session 1), the /design step evaluated 4 options:
- Option A: Single
.venv+ uv dependency groups (scored 97/105) - Option B: Multiple virtual environments (scored 68/105)
- Option C: Docker-only development (scored 51/105)
- Option D: System Python + pip (scored 41/105)
Each option was scored across 7 weighted criteria. The analysis took maybe 10 minutes. The design document (docs/design/2026-02-11_1848_dev_environment_strategy.md) is still the reference I consult when questions about the dev setup arise.
Compare that to the alternative: asking AI to "set up a dev environment" and getting whatever the model's default recommendation happens to be that day. That might work once. It does not produce decisions you can explain or revisit 6 months later.
60 design documents later, the pattern has proven itself. Design analysis is cheap with AI assistance. Rework from skipping it is expensive.
3. Self-Referential Testing (The Tool Reviews Itself)
The most powerful quality mechanism was not unit tests (though there are 324 of those). It was using the tool on itself.
Wrought's /forge-review skill runs up to 5 parallel AI subagents — 4 run on every review, each specialized in a different dimension of code quality: algorithmic complexity, data structure selection, paradigm consistency, and computational efficiency. A 5th — flow integrator — spawns conditionally when the diff touches navigation-surface files (routes, nav items, wizards, redirects). When this skill was built, it was immediately used to review its own codebase.
The results were humbling. The review found:
-
cli.pyhad grown to 936 lines with 6+ separate concerns (module cohesion violation) -
update_indexwas doing O(n) linear scans for upsert operations - A module-level constant (
DOCS_DIRS) was a mutable list — a classic Python footgun - A marker template was duplicated between
cmd_initandcmd_upgrade
All four findings were tracked, designed, blueprinted, planned, implemented, and verified through the pipeline. The code review system found debt in the codebase, and the pipeline system fixed it. Self-referential quality assurance.
The review system had a blind spot. None of the four subagents checked flow integrity — how a change to a route, a nav item, or a wizard step affects paths through the product. That gap went unexamined until a post-ship navigation bug in a frontend project exposed it. The fix followed the same pipeline — finding, design, blueprint, implementation, review — and in Session 86 a fifth reviewer, flow integrator, was added. The review system found a gap in its own rubric, and the pipeline built the subagent to close it.
The same pattern applied throughout development. The workflow enforcement engine that prevents skipping pipeline steps? It was built after the AI skipped a pipeline step in Session 41. The context compaction defense system? Built after auto-compaction destroyed an in-flight session. Every failure became a finding, every finding became a fix, and every fix was tested by continuing to use the tool.
This is not just dogfooding. It is a continuous quality feedback loop where the product's own methodology catches and corrects its own defects.
Three Things That Did Not Work
1. Building for 72 Sessions Without a Single User
This is the hardest thing to write, because it is the most important lesson.
At Session 72, a competitive landscape analysis revealed that the market had shifted significantly during those first months of heads-down building. An open-source project in the same space had accumulated 50,000+ GitHub stars and 119 community-contributed skills. Anthropic had shipped native features (Agent Teams, Tasks, code review) that overlapped with planned Wrought capabilities. The market window had compressed from an estimated 12-18 months to 6-9 months.
And Wrought had zero users. Zero revenue. Zero external validation.
The go-to-market findings tracker (docs/findings/2026-03-23_1430_wrought_go_to_market_strategy_FINDINGS_TRACKER.md) logged this as F1, severity: Critical. It is the most important finding in the project.
The numbers told the story. By Session 72, I had 31 findings trackers, 47 design documents, 43 blueprints — all focused inward. A sophisticated methodology producing sophisticated artifacts about a tool that no one outside the project had touched.
The fix came late but came clearly: stop perfecting internals, start external validation. Consulting as a bridge to revenue. Content marketing (including this post) as a bridge to users. Plugin distribution as a bridge to developers. All three should have started by Session 20, not Session 72.
That was the nadir. Sixteen sessions later — Session 88 — something finally broke the other way.
A colleague who had been reviewing Wrought's material with a domain practitioner he'd worked with for years sent me the call recap. The practitioner had identified — unprompted, and repeatedly — a specific pain pattern: operators running internal workflow steps by hand, one at a time, with no systematic capture of what worked. The kind of repetitive, structured, correction-loop work that Wrought's pipelines are designed for. A follow-up conversation was offered. The conversation has not yet happened.
This is not a closed user. It is not a purchase, not a contract, not a proof of product-market fit. It is one practitioner, in one vertical, in one conversation, identifying one pain that the tool might address. It is, however, the first external technical signal this project produced in 88 sessions — sixty more sessions than it should have taken to receive.
The lesson still stands. The arithmetic of earlier distribution would have compounded that signal many times over.
2. Over-Engineering the Internals
The pipeline skip enforcement saga is instructive.
In Session 41, the AI agent skipped the /plan step after /blueprint. A legitimate process violation. The response? A 3-layer defense-in-depth architecture: skill language hardening, CLAUDE.md rule tightening, and a code-enforced stage gate.
This was implemented, verified, and shipped. In Session 46, the agent added editorial commentary suggesting a skip might be acceptable. So the rules were further hardened with an explicit "no commentary" clause.
In Session 62, the agent used EnterPlanMode directly instead of the reactive pipeline. So a new rule (Rule 8) was added to CLAUDE.md, creating what was by then an 8-rule governance framework just for pipeline adherence.
Three sessions of engineering, three findings tracked, three design documents — all to prevent an AI from occasionally suggesting a shortcut. The enforcement worked, but the effort was disproportionate. A simpler rule with a simpler enforcement mechanism would have captured 90% of the value at 20% of the cost.
When your tool is good at tracking and fixing things, everything looks like something to track and fix. Not everything deserves the full pipeline.
3. Underestimating Distribution
The product thesis — that AI coding tools need structured discipline, not just more features — is, I believe, correct. The execution thesis — that building a great tool is sufficient for distribution — was wrong.
Having 15+ skills, 324 tests, a workflow enforcement engine, an up-to-5-agent parallel code review system, and a self-documenting methodology means nothing if the tool is not where developers look for tools. It is not on any marketplace. It is not a plugin. It has no content presence. The website exists but has no organic traffic.
The Claude Code ecosystem has a plugin format. Anthropic has a marketplace (or will soon). Dev.to, Reddit, and Hacker News have active communities discussing exactly the problems Wrought addresses. None of these channels had been touched before Session 70.
That shifted late in the game. By Session 75 the plugin had shipped — MIT-licensed, 42 files, installable with one command. By Session 79 a first blog post had landed. By Session 88 a three-week publishing cadence was running, and this post is its first entry. The direction is right. The timing was late by sixty-plus sessions.
Distribution is not a post-launch activity. It is a pre-launch requirement. If I were starting over, the first 10 sessions would include publishing content and engaging with the community — even before the tool was ready. Feedback from real developers would have shaped the product better than 60 design documents written in isolation.
The Category Question (Layer 4)
By April 2026, the AI coding tool landscape has taken shape. Five layers are visible; four are competed.
- Layer 1 — IDE and editor integrations (Cursor, Windsurf, Copilot inside VS Code).
- Layer 2 — chat and client UIs (Ona's new two-panel interface, Claude Code's terminal, ChatGPT Codex).
- Layer 3 — agent runtimes and infrastructure (Ona Cloud with warm pools and Project Veto sandboxing; Anthropic's Managed Agents, shipped April 2026; LangChain's Deep Agents; Anthropic's native Code Review; Nous Research's Hermes Agent).
- Layer 4 — methodology, governance, process.
- Layer 5 — tools and skills (skill marketplaces, community-contributed agents, plugin packs).
Layer 4 is empty.
There is no major player in the methodology layer. The closest public benchmark is Ona's six-criteria auto-approve-low-risk-PR mechanism — a 74% reduction in time-to-first-approval and a 3× increase in deployment rate. That result is a strong validation of process discipline as a lever, but it is rule-based automation, not methodology enforcement. It proves that a simple rubric can compound into measurable operational gains. It does not build the process layer.
Wrought is a bet on Layer 4. The 47 findings trackers, 60 design documents, 55 blueprints, and 23 forge-reviews are not product artifacts. They are methodology artifacts. The 84-to-1 ratio between lines of documentation and lines of source code is not a sign of over-engineering. It is what the methodology layer looks like when implemented.
The bet is that when everyone has AI coding assistants and the L1–L3 differentiators flatten, the variable that remains is whether teams have a process their assistant follows. That is Layer 4. Wrought claims it.
The Methodology: A Brief Overview
For those interested in the process itself, here is how Wrought's pipeline works. You can adopt this approach with or without the tool.
Every significant task starts with a Finding. A finding is a gap, defect, or drift — something that needs attention. It gets logged in a Findings Tracker with severity, type, and a proposed resolution path.
Two pipelines handle two types of work:
- Reactive (something is broken): Incident -> Investigate -> RCA/Bugfix -> Implement Fix -> Code Review
- Proactive (something needs building): Research -> Design -> Blueprint -> Implement -> Code Review
Design before code, always. The /design step produces a structured analysis with multiple options, weighted criteria, and a recommendation. It takes 10-30 minutes with AI assistance and saves hours of rework.
Blueprints specify acceptance criteria. Before implementation begins, there is a document listing exact file changes, acceptance criteria, and test expectations. The AI implements against this spec, not against a vague prompt.
Structured artifacts are cross-session memory. Every step produces a dated, typed document in a known location. When a new session starts, the AI can read these artifacts and resume exactly where work stopped.
Code review closes the loop. After implementation, up to 5 specialized AI subagents review the changes: 4 run on every review — algorithmic complexity, data structure selection, paradigm consistency, and computational efficiency; a 5th (flow integrator) spawns when the diff touches navigation-surface files such as routes, nav items, wizards, and redirects.
A typical feature takes three sessions and 3-6 hours: finding and research, then design and blueprint, then implementation and review. Each step produces a dated artifact. The methodology works. The artifacts prove it.
What Happens Next
Three things are now happening in parallel.
Consulting. The methodology that built Wrought is now offered as a service: production bug fixes with root cause analysis, feature architecture and implementation, codebase reviews, and Claude Code workflow setup. The pipeline works on any codebase, not just Wrought.
Plugin distribution. Wrought's 15+ skills, review agents, and hooks are packaged as an open-source Claude Code plugin — 42 files, MIT licensed, built in Session 75. Public marketplace distribution is scheduled to coincide with the V2.0 launch.
Content. This post is the first of a three-week publishing calendar that will land four long-form pieces: this retrospective, a benchmark response to Ona's 74%/3× auto-approve results (around 2026-04-25), a deep-dive on cross-session memory patterns (2026-04-29), and a piece on why design-first matters even more when AI makes code cheap (2026-05-08). Beyond that: structured incident response workflows, solo-founder retrospectives, and how the 88-session methodology maps onto team settings.
Wrought V1.0 is a local CLI tool. V2.0 will be an MCP server — a hosted service that any AI coding assistant can connect to over HTTP. The architecture is designed, the skills are built, and the distribution strategy is now in motion.
If the methodology interests you, there are two paths:
Try the approach. The Findings Tracker pattern and design-first pipeline work with any AI coding tool. Start by creating a markdown file that tracks your significant tasks through structured stages. You do not need Wrought to benefit from the process.
Follow the build. I will be publishing weekly at fluxforge.ai/blog and cross-posting to Dev.to and LinkedIn. The next piece covers cross-session memory in detail — the specific patterns that make AI assistants useful across long-running projects.
The 88 sessions taught me that AI does not replace engineering discipline. It amplifies whatever discipline you bring. Bring structure, get structured results. Bring chaos, get faster chaos.
Wrought is the structure I built. Now it is time to find out if anyone else needs it.
Originally published at fluxforge.ai/blog. Find me on LinkedIn or follow the build at fluxforge.ai.
Top comments (0)