Earlier this week I posted about Cursor 3 on LinkedIn and why it does not solve the problem I am working on. I wanted to write up a deeper piece on what the problem actually is and what I am doing instead.
This is that post.
The Five Levels (and Where Most People Are Stuck)
In January, Dan Shapiro published a framework I have not been able to stop thinking about: five levels of AI-assisted software development, from “AI as a smarter autocomplete” to what he calls the dark factory, borrowing from manufacturing. FANUC ran a factory in the 1980s where robots built robots for weeks at a time with no humans present. Lights off. No oversight. Just machines doing work.
Shapiro maps this to software:
Level 0: You write all the code.
Level 1: You delegate discrete tasks to AI.
Level 2: You pair program in real-time alongside AI. (This is where 90% of self-described AI-native developers are.)
Level 3: AI writes code, you review diffs and approve PRs.
Level 4: You write detailed specs, AI builds, you validate outcomes.
Level 5: Specs in, software out. Humans define what and why. The how is fully autonomous.
Level 5 is the dark factory. It sounds futuristic. It is not. StrongDM has been operating at Level 5 with three engineers since mid-2025. OpenAI built a million-line product in five months with three engineers and no manually written code. Spotify is merging 650 AI-generated pull requests per month. Engineers there have not written a single line of code since December.
I have been working my way toward Level 4 and 5 for the last several months. This is what that looks like in practice.
Why the IDE Is the Wrong Center of Gravity
Most AI coding tools, including the ones I used to use, are organized around the IDE. You open the editor, you prompt, you review. The agent is a very good collaborator inside your coding session.
The dark factory does not work that way.
At Level 4 and 5, you are not in the code. You are above it. BCG Platinion, in their analysis of organizations running dark factory programs, identified two competencies that separate teams who make it work from teams who do not:
Harness engineering: designing, building, and refining the factory itself. Choosing agents, configuring pipelines, managing orchestration.
Intent thinking: translating business needs into precise, testable descriptions of desired outcomes. The quality of everything the factory produces is determined entirely by the quality of the specifications going in.
Both of these happen above the IDE. Neither requires touching code. The bottleneck at Level 4-5 is not the coding environment. It is the coordination layer above it, where intent becomes specs, specs become tasks, tasks get assigned to agents, and outcomes get validated against what was actually intended.
That is the problem I am building toward.
The Methodology
Here is the stack I actually use.
Strategy Document. Before anything else, I write a high-level strategy document for the product or feature set. What problem does this solve? For who? What does success look like in six months? What does it explicitly not do? This is where decisions get made while they are still cheap.
Phases with Defined Value. The strategy breaks into phases. Each phase has a specific, articulable value proposition. Not “add feature X” but “after this phase, users can do Y, which they could not do before.” If I cannot state the user value clearly, the phase is not ready to build.
Sprints. Phases break into sprints. Roughly one to two week blocks that deliver something coherent. Not just tasks completed. Something you can point to.
Tasks. Sprints break into individual tasks. Small enough to be completed in a single focused agent session, specific enough to be unambiguous. Every task traces back to the sprint it belongs to, the phase that sprint serves, and the strategy driving the phase.
Per-Task Specs. This is where the methodology diverges from most people. For each task, I write a detailed specification. Not just what to build. What does this accomplish? How does it connect to the sprint goal? How does it serve the phase? What are the acceptance criteria? What are the edge cases? What should this explicitly not touch?
The spec is a document. It is reviewable, changeable, referenced throughout the build.
Adversarial Spec Review. Before any code gets written, I run the spec through adversarial review cycles. A separate agent, specifically prompted to find problems rather than validate, reads the spec against the task, sprint, phase, and strategy and asks hard questions.
This is the part most people skip. It is the most valuable part.
The adversarial reviewer is not trying to be helpful in the conventional sense. It is looking for ambiguity, missing edge cases, unstated assumptions, scope creep, and misalignment with higher-level goals. The spec goes back and forth until it is tight. Typically two to four rounds.
Build. Only now does a coding agent touch the spec. Because the spec is clear, the agent executes with high fidelity. It knows exactly what to build, why it matters, and what it should not touch.
Adversarial Code Review. Same idea applied to the output. A review agent reads the code against the spec. Not to validate style. Did this implementation accomplish what the spec required? Did it introduce anything the spec prohibited? Does it create issues at the sprint or phase level?
This runs until the reviewer is satisfied.
Staging and UAT. The build goes to staging. User acceptance testing, sometimes manually, sometimes with a QA agent for structured test cases. If something is off, the coding agent gets another round with specific, scoped feedback.
Production and Documentation. When it ships, the coding agent generates a summary of what was built, what changed, and how to use the new features. That summary lives in the task record. It is the handoff document for everything that builds on it.
What This Actually Changes
Most people use AI coding tools as a faster keyboard. You type less, the agent types more, you still own the entire judgment layer.
What I have described is different. The agents own significant judgment, especially in the review cycles. They are not autocompleting my thoughts. They are checking my work, finding blind spots, and pushing back on ambiguity before it becomes a bug.
The result is that I ship features with higher confidence, fewer rework cycles, and a paper trail that makes debugging tractable. When something goes wrong, I can trace back through the spec, the review comments, and the implementation decisions to find exactly where the error was introduced.
That traceability is not just useful for debugging. It is useful for the next build. The system learns from its own history in a structured way.
StrongDM benchmarks their dark factory by compute cost: if you have not spent at least one thousand dollars on tokens per engineer per day, your factory has room to improve. I am not at that level yet. I am somewhere between Level 4 and Level 5, which is a useful place to be. Far enough along to have validated the methodology. Close enough to the edge to still be finding the limits.
The Tool the Methodology Requires
When I started building this way, I ran into a problem: none of the tools I used were designed for it.
Linear and Jira are built for humans managing humans. GitHub Issues is code-centric, with no spec management and no way to give agents structured context. Notion is a docs tool, not an agent-aware coordination layer.
What the dark factory methodology actually needs is a place where strategy documents live next to sprint plans, specs live next to tasks, agents can pull context without manual handoff, and every level of the hierarchy is connected. The IDE is one slot in that stack. It needs to be able to read from a control plane that understands where the work fits.
I am building that control plane. It is the tool I needed and could not find. I will share more about it as it gets closer to ready.
Where This Is Headed
The dark factory is not the right metaphor for most people yet. Most teams are at Level 2. Cursor 3 is a genuinely good Level 2 tool, and that is a real and valuable thing.
But the trajectory is clear. Every few months, the ceiling on what agents can do reliably moves up. The teams that will be ready for Level 5 are the ones investing now in the two competencies that matter: harness engineering and intent thinking.
The spec is the system. The code is just what comes out the other end.



Top comments (0)