Anthropic Just Made Specs Load-Bearing

#ai #anthropic #productivity #agents

Today Anthropic shipped Managed Agents — and inside it, a feature called Outcomes.

Outcomes is small in scope and large in implication. The idea: when you dispatch an agent, you also define what success looks like. A separate grader evaluates the agent's output against those criteria and decides whether the work is done or needs another pass.

Most of the coverage focused on the self-correction loop. The deeper story is what Outcomes assumes — and what it quietly exposes.

1. Outcomes need testable success criteria

A grader can't grade a vibe.

For Outcomes to do anything useful, success has to be expressible in language that a separate model can evaluate without re-doing the work. That means specific, observable, decomposable. "The form submits and shows a confirmation." "No more than one network request per keystroke." "Email arrives within 30 seconds and includes the order number."

This is what verification has always looked like in good engineering. The difference is the audience. Until now, success criteria were a human courtesy — nice when the PM wrote them, fine when they didn't, because the developer would figure it out in review. With Outcomes, success criteria become the contract the agent is held to. They stop being decoration and start being load-bearing.

A vague Outcome doesn't fail loudly. It quietly accepts wrong work.

2. Most teams don't have them. They have tickets and Figmas.

Walk into the average product org and ask to see the success criteria for the next five tickets in the sprint. You'll get one of three answers:

"It's in the Figma." (It isn't.)
"The ACs are at the bottom of the Linear ticket." (Three bullets, all phrased as features.)
"Talk to Sara, she knows what we're trying to do." (Sara is on PTO.)

The intent of the work lives somewhere — in someone's head, in a Slack thread, in a comment on a design file. Almost never in a structured, retrievable, testable form.

This was tolerable when humans did the building, because humans are excellent at filling in gaps. They ask follow-up questions. They reread the ticket three weeks later and figure it out. They notice when something feels wrong even if no one wrote down what right looks like.

Agents don't do any of that. They execute against what they're given. Outcomes makes this concrete: the team that ships the clearer success criteria gets the better agent run. The team that doesn't ships nothing — or worse, ships something convincing but wrong.

The bottleneck used to be writing code. The bottleneck is now knowing what good looks like, in writing, before the work starts.

3. IntentSpec is what an Outcome looks like before it's machine-readable

This is the part most teams underestimate.

An Outcome — the JSON object you hand to the grader — isn't where the work happens. It's the output of the work. The work happens upstream, when someone sits with a real piece of user friction and decides:

What is the actual objective? (Not the feature. The change in user behavior.)
What outcomes would prove this worked? (Observable, decomposable, testable.)
What edge cases must hold? (The boring failure modes that ship bugs.)
What constraints can't be violated? (Invariants that don't show up in a happy path.)
What evidence grounds these decisions? (Tickets, interviews, telemetry — not vibes.)

That's a spec. Specifically, it's an agent-ready spec. When the spec is good, exporting it to an Outcome is a translation step. When the spec is missing, no amount of grader sophistication can rescue the run.

We've been calling this artifact an IntentSpec. The name doesn't matter. What matters is that the artifact exists, persists, and stays anchored to the evidence that justified it.

What this means for your team

Outcomes makes the spec the load-bearing artifact in agentic delivery. That's a one-sentence reframe with three uncomfortable consequences.

The cost of vague tickets just went up. Before, vague tickets cost a clarification meeting. Now they cost an agent run that completes successfully and produces the wrong thing.

Spec quality becomes a measurable input. When the grader rejects the work, you don't have a model problem — you have a specification problem. That's a much faster feedback loop than waiting for QA to find the bug.

Specs need to live somewhere persistent. Not in a Figma comment. Not at the bottom of a Linear ticket. Not in Sara's head. Somewhere the agent — and the next agent, and the next teammate — can read tomorrow and know what done means.

Anthropic just made specs the contract. The teams that already write them well are about to look extremely smart. The teams that don't are about to find out, expensively, what an agent does with ambiguity.

Originally published on pathmode.io. Pathmode is the intent engineering platform — we help product teams write specs that agents can actually execute against.