The outreach had been running for 10 minutes when the notification landed. An email sent to a contact my whole setup had been quietly avoiding for 6 months. My bad. Too late...
I reread the spec. It said: "follow up with active prospects." That was it. No exclusion list, no status filters, no constraint on who not to touch. Claude executed exactly what I wrote, without hesitation, without leaving anything out.
The spec is the bottleneck, not the model, not the prompt.
I Left One Line Out. Claude Did the Rest.
The agent had access to the full contact list. All of them: active prospects, dormant leads, a distributor we'd quietly stopped working with 6 months earlier, and 2 or 3 people from a partnership that had ended in a way nobody wanted to revisit via automated email. The spec said none of that. It said "active prospects," which in my head had a meaning built from months of context, 12 informal decisions, and at least 1 uncomfortable conversation I had completely forgotten to document anywhere.
Claude didn't know any of that. Claude knew what the spec said.
So it followed up. With everyone who wasn't explicitly excluded. Professionally, promptly, in exactly the tone I'd asked for. The contact list had no "do not email" flag because I'd never thought to add one. The spec had no exclusion rules because I assumed those were obvious. The agent ran on what existed, not on what I intended.
This is what changes with AI execution speed: a human dev handed the same spec might have asked a clarifying question, or made a judgment call, or at least hesitated on the edge case. An agent doesn't hesitate. It delivers. (In C, this is called undefined behavior. In AI agents, it's called "the spec I wrote.")
Claude didn't make a mistake. I left a door open.
(And yes, after this I reread every other spec I'd written for that pipeline. Found 4 more missing constraints in the first pass.)
GIGO Was Always the Rule.
GIGO: Garbage In, Garbage Out. The principle dates to 1957, coined on military mainframes when programmers discovered that feeding a computer incorrect data produced incorrect output with perfect consistency. The machine executed correctly. The problem was always upstream.
These weren't units with judgment. They were IBM 709s reading punch cards. Ambiguity wasn't an option. Input was binary, output was binary, and the gap in between was always the human.
For 60 years, GIGO described a data quality problem. Bad training data in, bad model predictions out. That framing still applies. What changed is which layer the garbage enters.
With AI agents, garbage enters at the spec level, not the data level. The input isn't a corrupted CSV file. It's an incomplete specification: a rule written too loosely, a scope defined too broadly, an exclusion list that lives only in someone's head. The agent doesn't absorb ambiguity. It executes through it, at whatever speed you gave it, against whatever data it has access to.
I noticed this week that my IDE auto-saves every 2 seconds. I've had this setting on and off at least 5 times over the years, and I genuinely cannot remember which way I actually wanted it. The setting feels important enough to exist as a setting but not important enough to write down the reason. Anyway, point is we're bad at documenting the small decisions.
Google Named the New Bottleneck Last Month
In May 2026, a Google whitepaper from Osmani, Saboo, and Kartakis landed on Kaggle and went relatively quiet for a few weeks before the dev community caught up. The main argument: the shift we're living through isn't a new language or a new framework. It's a move from writing code to expressing intent, and most failures that show up as "AI problems" are actually intention problems upstream.
By their numbers, roughly 85% of professional developers now use AI coding agents regularly, with around 41% of all new code being AI-generated. At that scale, a vague spec runs at generator speed against live customer data. Code review catches none of it. And the math compounds fast: if even a fraction of those generated codebases are running on incomplete specs, the surface area for incidents like my outreach one stops being edge-case territory. It becomes the default.
The paper's central framing: "Most agent failures, examined honestly, are configuration failures." Not capability gaps in the model. Configuration failures in the harness. Osmani wrote separately on his blog after the whitepaper that when an agent does something unexpected, his first move is to check the harness, not question the model. Usually it's a missing tool, a rule he wrote too loosely, or a guardrail he forgot to add. That matches exactly what happened to my outreach pipeline.
The earlier Bloomberg piece ended up diagnosing the wrong disease: they saw the symptoms correctly (productivity panic, buggy AI outputs, scope creep) and blamed the tools. The whitepaper is more precise. Capability failures and configuration failures are different problems that point to different fixes. One fix is a better model. The other is writing the spec you should have written before you ran anything.
Your To-Do App Spec Breaks Here.

Try this. Write a spec for a to-do app that causes a real production incident. Add a task, complete a task, delete a task, list tasks. There's nowhere for the spec to go badly wrong in a way that damages something real. The worst case is an item marked complete when it shouldn't be.
Now write a spec for an outreach automation touching partner contacts and active accounts. Who gets contacted, under what conditions, excluded when, from which list, based on which status field, with what override for accounts on hold, and what about the 2 partners currently in a contract renegotiation you've been handling personally for 3 weeks? Every 1 of those missing constraints is a door the agent will walk through.
It's basically the difference between the tutorial zone and the first real dungeon. Same mechanics, completely different consequence set. Most vibe-coded specs are written for the tutorial zone and deployed in the dungeon.
Most builders apply to-do-app spec discipline to production tools with real-world effects. "Follow up with active prospects" leaves open what "active" means exactly, which contacts are on a voluntary pause, and which ones are being handled personally by someone who will not be happy to see an automated email land. "Sync the product catalog with the partner feed" says nothing about references currently under contract review. "Process incoming orders" gives the agent zero guidance on what to do with an order already handled by the previous run.
Every missing constraint is something the agent will interpret in whatever way the data makes possible, at whatever speed you gave it. At partner feed scale, at outreach scale, at any scale where the tool touches real accounts or real money, leaving gaps in the spec isn't an oversight you'll catch in QA. It's a production incident waiting for the right combination of data and timing.
The agent builds both with the same confidence. Domain complexity doesn't transfer automatically into spec complexity. You have to write it.
What I Built to Catch the Gaps.
After the outreach incident, I put a spec refinement layer between the builder and the agent on any new automation. Nothing complicated: a dialogue guided by Claude that runs before the first line of code. It doesn't write the spec. It asks questions you have to answer.
Questions like: who or what should this agent not touch, under any condition? What implicit business rules govern this domain that you've never written down anywhere? What counts as "done," and what counts as "never start"? What happens at the edges: an expired account, a contact in a dispute, a product flagged by legal, an order already processed by the previous run?
The first time I used it on a new distribution automation, the initial input was "sync the product catalog with the partner feed." First question the tool generated: "Are there product categories that should be excluded from this sync?" I sat there for about 5 seconds. Then I opened a doc from 4 months ago and found 12 product references flagged as under contract review. I had never once thought to put them in the spec.
That silence after the question is its own diagnostic. Hesitation over 2 seconds = missing constraint. Same instinct as hitting "merge to main" when you've skipped the test run. Your gut knows before your brain does.
The value isn't the output document. It's the 10 minutes of discomfort before the agent runs. I think this only catches what you're already half-aware of at some level. If a constraint has never crossed your mind at all, no question will surface it. That's the honest ceiling.
Specifying Is the Skill. Prompting Is a Subset.
Prompt engineering is the skill of getting better output from a model in a given conversation. Specifying is the skill of defining what the agent is allowed to do, who it can touch, what data it can act on, and where it stops. These are different disciplines operating at different levels of the stack.
A clean prompt gets you a better answer within whatever scope the agent already has. A careful spec defines what the scope is. The builder who writes clean prompts against a loose spec ships fast and breaks things in production. The builder who specifies carefully ships slower up front and recovers from fewer incidents later. With AI compressing delivery timelines from weeks to hours, "later" now means the same afternoon.
If you want to operate at the project level and not just the conversation level, the full prompt contracts framework covers exactly this: what the agent can generate, what it can modify, what requires explicit human sign-off. It's what I moved to after enough incidents like the email one.
The approach I lay out in Vibe Coding, For Real covers the earlier stage: before any code generation starts, define the boundary conditions of the domain, not just the feature list. What the system should never do and which business rules have never been written down because everyone assumed they were obvious.
An agent has no intuition about what you meant. Only about what you wrote.
3 Questions Before Claude Gets Access.
The incident email would not have been prevented by a better model, or a better prompt. It would have been prevented by 10 extra minutes of spec work before the agent ran.
3 questions I now run before any agent gets access to live data:
Who or what should this agent not touch, under any condition? Not "who is excluded from this batch" but who is permanently off-limits regardless of status, regardless of list membership, regardless of what the data shows. These exclusions need to be explicit in the spec, not implied by context.
What implicit business rules have you never written down anywhere? Think: the partnership you quietly paused in November, the product references sitting in a legal review doc nobody opens, the 2 or 3 contacts you personally told "I'll handle this one" and never flagged anywhere in the system. Every 1 of these is a constraint the agent doesn't know about unless you write it.
What is the actual risk level of this project? Not what you hope the risk is, but what happens if the agent runs on everything it has access to. A miscategorized task in a to-do app costs you 30 seconds. An email to the wrong partner costs you a relationship and a morning. A modified order in a live ecommerce feed costs you a billing incident and possibly a customer. The level of detail in the spec has to match the level of risk in the domain, not just the feature count.
The incident email cost me 1 morning and 1 professional relationship to repair. The agent ran exactly as specified. I just forgot to write the constraint.
That's how it works now. The AI executes what you write, not what you think.
gonna have to learn to be explicit now...
Sources
- Osmani, A., Saboo, S., Kartakis, S. (2026, May). The New SDLC With Vibe Coding. Google / Kaggle. https://www.kaggle.com/whitepaper-the-new-SDLC-with-vibe-coding
- Osmani, A. (2026, June). The New Software Lifecycle. https://addyosmani.com/blog/new-sdlc-vibe-coding/
- Eveilleau, P. (2026). Vibe Coding, For Real. Amazon Kindle. https://a.co/d/0d7xiVlX
This post may contain affiliate links. If you click them, I might earn a small commission (costs you nothing, and helps me keep shipping quality articles every day for your reading pleasure).
Top comments (0)