Peter Williams

Posted on Mar 9

Discovery-First Orchestration

#ai #productivity #softwaredevelopment #softwareengineering

If you read the recent Latent.Space piece on killing code review, you walked away with a clear argument: the bottleneck has moved from writing code to reading it. AI writes too fast for humans to review.

His solution? Move the human checkpoint upstream. Review intent, not implementation.

"The most valuable human judgment is exercised before the first line of code is generated, not after."

It's a solid insight. But the bottleneck is mislabeled.

Code review isn't about reading. It's decision making.

Reviewing intent is too. And no matter when you do it, decision making takes context.

Moving the decision to the front of the line (spec writing/reviewing) requires just as much human effort to collect that context.

The language changed from C++ to English. The bottleneck remains just as tight. You can't leverage the speed that AI coding enables. You can't achieve the confidence of human code review.

The joy of vibe coding is stumbling into something that just works. But just as often, the AI will call it quits and nothing works at all.

The confidence that code reviews create comes from understanding what should be, and seeing it is.

Around this image is where attribution gets tricky. The image is custom. The rules came from here. And the inspiration trickles all the way through from StrongDM.

It's hard to gist all three of those together without creating a strawman. So I'll explicitly do that to keep moving.

"With enough implementations and tests and checks the code will do what we want it to." - No one in particular

Again this is close, but falls short. Specs, tests, reviews, and guardrails can ensure things function, but they don't give accuracy, just precision. A human user is going to decide what should be eventually. No amount of preparing and checking is going to confirm you're on the right mark until then.

Stop thinking of yourself as a developer. Or an architect. Or even a product owner.

Make yourself the first user. A superuser. One with the power to rewrite the whole codebase from scratch on a moment's notice. Harness that power, and don't stop until you're sure that what "should be", "is".

That's what Discovery-First Orchestration (DFO) is all about.

The Problem with Specs

You write a spec. It feels productive. You're documenting requirements, thinking like a customer, making decisions, planning the architecture.

Except you're not. You're guessing.

I'm sure you've read these before as tweets, but they apply here.

All models are wrong, but some are useful.
Learn by doing.
Pave the cow paths.

Spanish class didn't make you fluent. Buying a workout plan didn't give you abs. Writing a 200-page business plan didn't make you rich.

In SW, you can't fully understand a system as a concept.

System design is high-level accuracy without precision
Integrating a feature reveals a dependency you didn't expect
Feedback reveals a use case you never imagined

This isn't a failure of planning. It's the nature of complex systems. Building, like most everything in life, is a conversation with reality. You can't have that conversation in your head.

And here's the thing we all ignore: nobody goes back to fix the spec after the code is built. You ship the code, the spec rots, and every future developer curses your name when they try to figure out your "intent".

It is directionally correct to review intent not code. But moving upstream is a mistake. Specific intent isn't correct intent. Intent before dev doesn't match the truth of reality after. Correct intent is learned from doing. You can't build a perfect model.

DFO doesn't try to make the best model. It just asks, "which parts are useful?" and keeps them.

The DFO Loop

DFO is a structured loop to reverse engineer perfectly satisfactory specifications. No explicit writing or review necessary. Here's how it works:

Prompt → Gen → Extract → Compare → Decide → Compile ↺

1. Prompt

You start with a prompt. Not a spec. Simple. You define as much as you want to, but you don't have to define everything. "A task management app." "An API for my CRM." That's it.

A spec implies you've already decided what matters. A prompt includes opinions and directions, but leaves room for discovery.

But here's the important part: you choose your own adventure.

Want to write a detailed prompt with all your requirements? Go ahead.
Want to write a one-liner and see what happens? Also fine.
Want to write a full spec upfront but still benefit from the discovery process? Sure, I guess.

All valid. All allowed.

DFO doesn't prescribe maximum human effort. It drops the minimum through the floor, but doesn't take agency away. If you want to be deeply involved in every step of every cycle, you can be. If you want to prompt once and 'accept all', be you. Software in the new era doesn't have to be a black box.

2. Generate

Your prompt is a 'one-shot', but you can (and should) run a few in parallel.

Four implementations is a practical starting point, but four isn't magic. You might scale up or down depending on complexity. You can even run with a single, it just makes the process slower.

The sliding scale isn't linear, but you can think of it as: more parallel tries = less saying "no go try again" without real progress for that cycle. Give yourself more chances to get things right.

Note: The prompt shouldn't tell one agent to "design four different versions", nor should it leave differentiation out entirely. Both paths will hurt the process. Instead, ask individual agents to choose an opinionated perspective as part of the development. This will allow for different interpretations, but not force divergence for divergence's sake.

One version might optimize for "performance". Another might prioritize "easy onboarding". One might design your CRM for developers as the primary user; another for PMOs; another for sales teams. The opinions don't have to be right to be useful.

And nothing says you have to use the same model for all of the implementations (or any other parts I'll get to). Pit models against each other and let them compete for selection. Get diversity without explicitly asking for it.

3. Extract Facts

This is where it gets interesting. Traditionally we go toward code review here. But who wants to look at 4x the code when you already can't keep up with AI gen?

StrongDM coined a similar idea as Gene Transfusion: letting agents reproduce code from one implementation to another. In that analogy, I'm calling for genetic sequencing. Turn the exemplary code into the building blocks of a spec which can be used to deterministically recreate the code.

Keep the AI working (with some parsing help). Extract what I'm calling Facts from each implementation.

A Fact isn’t a summary or opinion. It’s a property the code demonstrably enforces: authentication method, database schema, state model, routing structure, API patterns, etc., and how it does.

These aren't requirements you're writing down. They're patterns you're pulling out of code that already exists. You extract directly, without any additional analysis layer. The facts are what the code does, not what someone thought it should do.

4. Compare and Recommend

But our goal is to find the right should be. Not just what is in a single attempt, or a set of them.

Now you compare the facts across implementations. You look for convergence. You assess confidence in this choice or that one.

Wait. Isn't that still code review? Oh, yeah. You don't do that. The AI does this too.

Some facts are strong and easy to identify: 3/4 use JWT for auth. Some are weak: only one got working file uploads, or everybody picked a slightly different user table schema.

Based on this comparison, the system recommends what to do with each Fact:

Lock: High confidence, likely from high convergence. If you agree, then this becomes deterministic for future cycles
Probable: Some signal but not definite. Include guidance, but stay open to different ideas
Exploratory: High divergence and/or low confidence. Let generation run free and discover as it did for other omissions from the original prompt
Rejected: Confidence against any/all patterns. Treated the same as Exploratory, but with filtering against certain paths

Recommends is key. The AI isn’t deciding truth, it is surfacing patterns. A human user still drives rising confidence.

5. Decision Time

Now the human needs to enter the loop.

You test the code, play with the UI, get a feel for what you like or don't.

Heck, go deep with acceptance criteria and rigorous tests if you want. But don't try to "think like a user". Be one. Evaluate the options in front of you and determine what you will or won't accept.

You can read the facts and recommendations. Ask for clarity or context. Give your own suggestions. Make final decisions. But you don't need to.

Tell the AI about your user experience, and let it interpret your feedback toward the Facts.

The human's job isn't to think of everything, understand the code, or manage the Facts. AI does all of this well enough, and way faster than you. You bring what AI can't: taste.

Use your user experience to evaluate the AI-generated code, Facts, and recommendations however you choose.

AI checking is about what works and looks good, not what is right. You can feel what's wrong after just a few seconds as a user.

This is similar to the distinction between recognition and recall. Implemented code and extracted Facts give you the easy path to right. The process, like your UX, should feel borderline effortless.

6. Compile Specs

After review, the Facts compile into a 'spec'. Similar to, but different from, the spec you would have written upfront. Grounded in reality. Based on design and functionality that proved better than its cohort.

Spec by survival of the fittest.

This spec is also more detailed. It has to be, because the volume of Facts will greatly exceed anything a human would sit down and write. It becomes a store of every little building block of intent, with evidence to back up each one.

But the spec isn't an output. It's a chance to restart.

Cycle after cycle, specs get more defined. Constraints lock in. Code generation converges.

You aren't writing a spec and hoping to be detailed enough for the AI to get it right.

You are making the AI get the correct things right by brute force, and then locking down the magic words that got you there.

Selective Determinism

Eventually, you reach a point where the locked specs drive every implementation to feel the same. The important Facts have converged. Others might not be there, and that's fine. You find something surprising: Most aspects don't matter as much as you thought.

You started by trying to reverse engineer specs from code. Now you are able to reverse engineer code from Facts. The loop is complete. The specs deterministically satisfy all known user needs.

When writing specs you might have spent hours debating table schema, API endpoints, button colors, and countless other choices. Now you will see which were important, and naturally push those leverage points while avoiding the time sinks.

Things you wouldn't have thought to specify might turn out to be critical. You'll find those early! And they'll come with a set of options to compare.

But even the unimportant aspects need to be decided, right? Yes, just not by you. Delegate those and let them fit the structure built by the important aspects. This helps you avoid setting constraints that only serve to hamstring discovery.

That's Selective Determinism: you lock what matters, and you stop caring about what doesn't.

Think of it like kosher food. It's available for everyone, but it's specific and optimized where it needs to be. Most of the details don't matter to most people. The ingredients that do matter are locked down.

You didn't waste time researching or detailing every last bit.

You did find the right specs, but the real reward was the ones you learned to ignore along the way.

What's Next

There are technical pieces we haven't touched: the implementation for extraction, the data structures behind "facts" and "confidence", benchmarks of selective determinism, how and when to scale DFO or know whether to apply it at all.

This article focuses on the core loop. Implementation details are engineering problems that follow from this initial prompt. They're important scaffolding, but none change the core insights.

The Facts of DFO were locked in long ago, but never correctly assembled.

What's Old Is New

DFO combines eight pillars, each partially shared with other paradigms:

Pillar	DFO	SDD	Agile	TDD	MDD	GP	A/B
Generative Multiplicity	✔	△	△	✖	✔	✔	✔
Comparative Evaluation	✔	△	✔	✔	✖	✔	✔
Outcome-Derived Truth	✔	✖	△	✖	✖	✔	✔
Selective Stabilization	✔	✖	✔	✔	✔	△	✖
Iterative Regeneration	✔	△	✖	✖	✔	✔	✖
Human Taste Steering	✔	✔	✔	△	△	✖	✔
Cheap Exploration	✔	✖	△	✖	△	△	△
Non-Prescriptive Start	✔	✖	✔	✖	✖	✔	✔

Creating A System Compiler

A classic compiler pipeline looks roughly like this:

Source Code
   ↓
Lexing
   ↓
Parsing
   ↓
Abstract Syntax Tree (AST)
   ↓
Optimization / Transformation
   ↓
Code Generation
   ↓
Executable

Compilers convert messy human input into structured truth, then regenerate a system from that structure.

That's exactly what DFO does, but at the level of systems instead of syntax.

None of the pillars are new. And neither are the insights.

I'm only trying to reframe the Facts we already know through an AI looking glass.

All models are wrong, but some are useful.

Learn by doing.

Pave the cow paths.

TL;DR

No model (or spec, or user story) will ever be perfect. The best way to learn is by doing. Use(rs) will show you what the code needs to do.

Make yourself the first user. A superuser. One with the power to rewrite the whole codebase from scratch on a moment's notice. Harness that power, and don't stop until you're sure that what should be, is.

Spec by survival of the fittest.

Most aspects don't matter as much as you thought.

You are making the AI get the correct things right by brute force, and then locking down the magic words that got you there.

Don't review intent. Don't write specs or user behaviors. Reverse engineer them.

The best way to know what to build, is to build it and find out.

DEV Community