The agent generates 40 lines of code. You read the diff. You approve the change.
What just happened there? You didn't write code. You didn't design a flow. You didn't even click through a UI. You judged. And that judgment, that single moment of evaluation, is the most important skill in software development right now.
Software development is shifting from execution to evaluation. Agents write. Humans judge.
This shift has a name. And it's not new. Don Norman described it in 1986.
The Two Gulfs
In The Design of Everyday Things, Norman introduced two fundamental gaps that exist between a user and any system they interact with:
The Gulf of Execution is the gap between what a user wants to do and how they figure out how to do it. How do I trigger this action? Where is the button? What's the right command?
The Gulf of Evaluation is the gap between what the system did and whether the user understands if it worked. Did that do what I expected? Is the system in the right state? Was that correct?
For decades, most of the hard work in UX and developer tooling has been about closing the Gulf of Execution. Better affordances. Clearer navigation. Autocomplete. Syntax highlighting. Documentation. All of it aimed at the same question: how do I do this?
Agents made that question much easier to answer.
The Gulf of Execution Shrank. The Gulf of Evaluation Exploded.
When you describe what you want to an agent ("add input validation to this form"), the agent doesn't make you figure out how to do it. It just does it. It navigates the codebase, writes the code, runs the linter, and presents you with a diff.
The Gulf of Execution shrank dramatically. It collapsed into a single interface: the prompt.
But it didn't disappear. Prompting has its own Gulf of Execution. The same request that seems simple ("add input validation") is actually ambiguous: client-side or server-side? Which library? What error messages? The user still has to know enough to instruct the system well. The Gulf of Execution compressed into prompting. The Gulf of Evaluation became the bottleneck.
That distinction matters. The complexity didn't vanish. It moved.
Before agents, the vast majority of time went to execution. Evaluation happened naturally as a byproduct of doing. Now that execution is instant, the bottleneck shifts entirely to the evaluation side. The volume of output that needs review grows faster than the time available to review it. That's not an incremental change. It's a qualitative shift in what the job actually is.
Now the hard question isn't how do I write this? It's:
- Is this implementation correct?
- Does it match our architecture?
- Are there edge cases the agent missed?
- Is this the right approach, or just an approach?
- Should I approve this?
That last question used to be trivial. Now it carries the entire weight of the interaction.
Human-in-the-Loop
Human-in-the-loop (HITL) is not a new idea. It's a well-established principle in ML systems: keep a human involved at some point in the automated decision cycle to ensure accuracy, safety, and accountability.
The classic HITL question is binary: is a human in the loop, yes or no?
In the age of agentic tools, that question is already answered: yes, obviously. The human reviews the output before it ships. But that answer is no longer enough.
The new question is: at which point in the loop should the human be?
The answer depends on three variables: risk level, reversibility, and domain expertise required. Put together, they give you a practical heuristic:
- Low risk + reversible (a copy change, a CSS tweak): auto-approve, review after the fact if needed.
- High risk + irreversible (a schema migration, a payment flow): pre-review with strict gating before the agent proceeds.
- Ambiguous expertise (does this touch design and architecture?): loop in both.
In other words, deciding where the human sits in the loop is itself a Gulf of Evaluation problem. There's no button that tells you if you got it right.
Cursor as a Case Study
Cursor's workflow makes this concrete. At every stage of the development cycle, someone decided who evaluates. That decision is design:
- Prompt: human, or agent running on prior instructions.
- Generation: agent, with or without active supervision.
- Diff Review: human, or delegated to the agent for a first pass.
- Keep/Reject: human, or auto-approved based on predefined rules.
- Tests: agent, with human review if they fail.
- Commit: human, or automated.
- PR: automated review by BugBot, or human review depending on risk.
- Deploy: human, or automated depending on the environment.
The human isn't removed from the loop. They're repositioned within it, and the position is a choice. Someone has to decide where the checkpoints are, what triggers human review, and what gets auto-approved. That's not a technical decision. It's a design decision with technical consequences.
The same pattern appears wherever agents generate output. In v0 or Google Stitch, a designer accepts or rejects a generated component. The loop is the same. The evaluation problem is the same. What varies is how explicit the checkpoints are and who owns them.
There's a risk that comes with all of this: evaluation fatigue. When agents generate output faster than humans can meaningfully review it, the temptation is to approve without reading. The approval becomes automatic. That's the failure mode this entire shift is trying to prevent. Designing the Gulf of Evaluation well means designing against that tendency, not just assuming that having a human in the loop is enough.
The Work That Remains
Agents can execute. They can verify. They can catch bugs, propose fixes, and review their own output. What they can't do is develop judgment.
Judgment isn't a skill you learn in a course. It's built through years of shipping things that didn't work, noticing what separates good from great, and developing an intuition for what's right before you can articulate why. Taste works the same way. It accumulates slowly, through exposure to excellent work, through mistakes, through the friction of real constraints.
That's not a temporary limitation of current models. It's a structural difference between pattern recognition at scale and the kind of contextual understanding that comes from doing the work over time.
Agents generate outputs. They do not generate standards.
The system can propose answers. Only you can decide if they're acceptable.
Judgment is yours. It's something the agent cannot generate.
Further Reading
- The Design of Everyday Things: Don Norman (1988). The source of both gulfs. Still the clearest articulation of how humans interact with systems.
- Agent Mode: Cursor official documentation. Practical breakdown of how agents work and where human judgment fits in an agentic workflow.
- Design in Tech Report 2026: From UX to AX: John Maeda. Broader context on the shift from execution to evaluation in the AI era. Worth reading for the historical arc from 2015 to today.


Top comments (0)