DEV Community

Anup Karanjkar
Anup Karanjkar

Posted on • Originally published at wowhow.cloud

The Spec-Density Score: Agent Spec Quality 2026

TL;DR — The WOWHOW Spec-Density Score is a 0–100 rubric that grades an AI agent spec across six dimensions: constraints, acceptance criteria, examples, failure modes, tool scope, and rollback. Specs below 50 reliably break in production within the first week. Score your spec before writing a single line of agent code.

A spec that scores below 50 on the WOWHOW Spec-Density Score will produce an agent that fails in production within the first week — not because the model is bad, but because the spec gave it nothing solid to stand on. After analyzing dozens of agent build cycles, a clear pattern emerges: the gap between specs that work and specs that don't isn't about length or clever prompting. It's about density — how much load-bearing information is packed into each dimension. The Spec-Density Score is a WOWHOW framework: a 0–100 rubric across six dimensions that lets you audit a spec before a single line of agent code is written. This post walks through the scoring table, explains why each dimension predicts failure, and shows a worked example on a real-world agent spec draft.

Why Agent Specs Fail

Agents differ from traditional software in one critical way: they make decisions at runtime that you cannot fully anticipate at write time. A function either returns the right value or it throws. An agent misinterprets an ambiguous instruction and silently does the wrong thing for 200 rows of data before anyone notices.

That asymmetry is what makes spec quality so consequential. When you write a function spec, ambiguity surfaces at compile time or in the first unit test. When you write an agent spec, ambiguity surfaces at 2am when the agent has consumed 40,000 tokens and is confidently doing the wrong thing.

The failure modes cluster into six buckets, which is exactly what the Spec-Density Score measures:

  • Constraints — what the agent must never do

  • Acceptance criteria — what "done" looks like

  • Examples — concrete input/output pairs

  • Failure modes — explicit enumeration of known bad paths

  • Tool scope — exactly which tools the agent may call and when

  • Rollback — how to undo what the agent did

Each dimension is scored 0–17 (except Rollback, which is weighted at 15), giving a maximum of 100 points. A score under 50 is a red flag. Under 35, stop and rewrite before building.

The Spec-Density Score: Scoring Table

The table below is the complete WOWHOW Spec-Density rubric. Each dimension has three bands: 0 (missing or useless), partial (exists but incomplete), and full (ship-ready). The weights reflect empirical importance, not symmetry — Failure Modes and Rollback are the two most commonly skipped dimensions and the two that cause the most expensive production incidents.

Dimension Weight 0 points — Missing/Useless Partial (half weight) Full points — Ship-Ready
| **1. Constraints** | 17 | No constraints listed, or only "be accurate" / "be safe" platitudes | At least one hard constraint, but no distinction between hard limits and soft preferences | Hard constraints (NEVER do X) and soft constraints (prefer Y) are explicitly separated; each constraint has a reason |

| **2. Acceptance Criteria** | 17 | No success definition, or "the task is complete when it looks right" | Some criteria exist but are subjective ("output should be clean") or missing edge cases | Criteria are machine-checkable: specific field values, status codes, file paths, record counts, or observable side effects |

| **3. Examples** | 17 | No input/output examples provided | One example exists but it is the happy path only; no edge cases or boundary inputs | At least 3 examples: happy path, one edge case, one near-failure case. Each example has input, expected output, and why |

| **4. Failure Modes** | 17 | No failure modes listed; spec assumes success | One or two failures named ("API might be down") but no recovery path or detection heuristic | At least 4 failure modes enumerated. Each has: detection condition, agent behavior on detect, escalation path if unrecoverable |

| **5. Tool Scope** | 17 | No tool list; agent infers what tools to use | Tools are named but no per-tool constraints ("use the search tool" with no rate limit, no forbidden queries, no auth context) | Every tool is listed with: allowed operations, forbidden operations, rate/cost guard, and auth/secret context. Unlisted tools are explicitly off-limits |

| **6. Rollback** | 15 | No rollback path; agent actions are irreversible by design or oversight | Rollback is mentioned ("can be undone") but no concrete steps or pre-condition checks | Rollback is a named procedure: pre-action snapshot, rollback trigger condition, exact rollback steps, and verification that rollback succeeded |
Enter fullscreen mode Exit fullscreen mode

Partial scores use half the dimension weight (rounded down). So a Constraints dimension that is partial scores 8, not 0 or 17.

How to Calculate Your Score

Read your spec once per dimension. Assign 0, partial, or full. Sum the scores. That's it. The math is deliberately simple because the hard work is the reading, not the arithmetic.

Score interpretation:

  • 85–100: Ship-ready. This spec will carry a production agent.

  • 65–84: Build-ready with known gaps. Acceptable for a staging agent or a low-stakes automation. Fix gaps before production.

  • 50–64: Draft quality. The agent will encounter at least one unhandled failure in the first 48 hours. Rewrite the lowest-scoring dimensions before building.

  • 35–49: Prototype only. Use this spec to generate a skeleton, then throw it away and rewrite from scratch with what you learned.

  • 0–34: Do not build. This spec will produce an agent that destroys time, money, or data. Stop here.

Worked Example: A File-Organizing Agent Spec

Here is a real spec draft (condensed for this post) from a file-organizing agent task — the kind of thing an autonomous coding assistant might tackle when given "clean up this repository."

The Original Draft Spec

"The agent should scan the repo, identify misplaced files, and move them to the correct folders according to the project conventions. It should also rename files that don't follow the naming convention. When done, it should report what changed."

Score this against the rubric:

Dimension Weight Assessment Score
| Constraints | 17 | Zero constraints. Nothing says "never delete," "never touch node_modules," "never move files with open git changes." | 0 |

| Acceptance Criteria | 17 | "Report what changed" is too vague. No definition of "correct folders" or "naming convention." | 0 |

| Examples | 17 | No examples whatsoever. | 0 |

| Failure Modes | 17 | No failure modes. What if the target folder doesn't exist? What if two files would resolve to the same name after rename? | 0 |

| Tool Scope | 17 | No tools specified. The agent will infer access to filesystem read/write, git, possibly shell exec. | 0 |

| Rollback | 15 | No rollback. Once files move, they move. | 0 |
Enter fullscreen mode Exit fullscreen mode

Total: 0/100. This is a one-sentence task description, not a spec. An agent built from this will happily rename your README.md to readme.md, move your .env somewhere "logical," and skip reporting when the run crashes halfway through.

The Rewritten Spec

Here is the same agent spec rewritten using the Spec-Density framework:

Constraints (hard): Never delete any file. Never touch files inside node_modules/, .git/, or any directory whose name starts with a dot. Never move a file that has unstaged git changes (check via git status --short before each move). Never rename a file if the target name already exists. Reason: the agent cannot know whether a "misplaced" file is load-bearing in its current location.

Constraints (soft): Prefer minimal changes. If a file is within one directory level of its "correct" location, flag it for human review rather than moving it automatically.

Acceptance Criteria: After the run, git diff --stat HEAD shows only file renames and moves, zero content changes. A reorganization-report.md exists at repo root containing: files moved (source → destination), files renamed (old name → new name), files skipped (with reason), and files flagged for human review. All items in src/ follow the kebab-case.ts naming pattern. All items in tests/ end in .test.ts.

Examples:

  • Happy path: src/Components/UserCard.tsxsrc/components/user-card.tsx. Expected output: move confirmed in report, git shows rename, no content diff.

  • Edge case: src/utils/helpers.ts already correctly named. Expected output: no action taken, file not listed in report.

  • Near-failure: two files would rename to the same target, e.g., UserCard.tsx and user-card.tsx both in scope. Expected output: both flagged for human review, neither moved, conflict logged with both source paths.

Failure Modes:

  • Target directory does not exist: create it only if the spec explicitly maps to that path; otherwise flag for review.

  • Git is dirty (uncommitted changes in the file being considered): skip that file, log it as "skipped — uncommitted changes".

  • Name conflict after rename: flag both files, move neither.

  • File is binary (image, woff, pdf): skip unless explicitly in scope for this run.

  • Agent token budget exhausted mid-run: write a partial report immediately, mark it as "INCOMPLETE — resumed run needed", exit cleanly.

Tool Scope: Filesystem read (any path outside forbidden directories). Filesystem write — move and rename only, no create or delete. git status --short read-only. Report writer to reorganization-report.md. Shell exec is off-limits (no npm install, no git commit, no arbitrary commands). The agent does not have permission to push, commit, or stage changes.

Rollback: Before the first file move, create a snapshot file at .reorganization-snapshot.json listing every planned move with source and destination. Rollback trigger: the agent or a human runs node rollback-reorg.js which reads the snapshot and reverses each move in reverse order. Rollback verification: git diff HEAD returns empty after rollback. The snapshot file is deleted only after human confirms the reorg is final.

Score this rewrite:

Dimension Weight Assessment Score
| Constraints | 17 | Hard and soft constraints explicitly separated, each with a stated reason. | 17 |

| Acceptance Criteria | 17 | Machine-checkable: git diff output, report file existence, naming pattern, file extension pattern. | 17 |

| Examples | 17 | Three examples: happy path, no-op edge case, conflict near-failure. Each has input, expected output, reason. | 17 |

| Failure Modes | 17 | Five failures enumerated. Each has detection condition, agent response, and escalation or exit path. | 17 |

| Tool Scope | 17 | Every tool named. Forbidden operations explicit. Shell exec specifically prohibited. | 17 |

| Rollback | 15 | Named procedure. Snapshot before first action. Rollback script. Verification step. Cleanup condition. | 15 |
Enter fullscreen mode Exit fullscreen mode

Total: 100/100. That does not mean the agent will never fail. It means the spec gives the agent everything it needs to handle failure gracefully instead of silently.

The Dimensions That Kill Agents Most Often

Failure Modes: The Most Skipped Dimension

Specs written by engineers who know the system well tend to skip failure modes because the engineer mentally simulates the happy path and stops there. The agent has no such mental model. It will encounter the failure mode the engineer "obviously" assumed could never happen, and it will have no instruction for what to do next. So it hallucinates a recovery path, which is worse than doing nothing.

The minimum useful failure mode entry has three parts: the detection condition ("when the API returns 429"), the agent behavior ("wait 60 seconds and retry once"), and the escalation path ("if the second attempt also fails, write the failed items to a retry queue and exit with status code 2"). Anything less is a placeholder, not a failure mode.

Tool Scope: The Dimension That Creates Security Incidents

Agents with undefined tool scope will call the most powerful tool available when a lower-powered one would suffice. An agent allowed to "use the database tool" with no further constraints will write DELETE queries if it decides that's the cleanest way to solve the problem. Not out of malice — because you told it to solve the problem and it has access to a tool that can do it.

Tool scope entries need four fields: allowed operations (read-only? specific write types?), forbidden operations (never DELETE, never DROP, never shell exec), rate or cost guard (maximum API calls per run, maximum rows returned), and auth context (which credential does this tool use, and does the agent have permission to use it for this specific task or just generally?). A tool that is not listed is not available. That sentence should appear verbatim in every agent spec.

Rollback: The Dimension That Determines Whether Mistakes Are Recoverable

Most agent specs treat rollback as an afterthought — "we can undo it if needed." But "we can undo it" is not a rollback plan. A rollback plan names: the pre-action state capture (snapshot, backup, transaction log), the trigger condition that initiates rollback (human command? automated detection of bad state?), the exact steps to reverse the agent's actions, and a verification test that confirms the system is back to pre-run state.

The classic failure here is building a spec for an agent that sends emails, posts to Slack, or calls an external webhook — and not noting that those actions are irreversible. If your rollback dimension says "N/A — actions are irreversible," that is a full-score entry. It means you thought about it. It does not mean the spec is bad. What kills you is an agent that sends 400 emails before you notice the bug, and you never wrote down that emails were permanent.

Common Scoring Traps

Trap 1: Mistaking length for density

A spec can be 3,000 words and score 15 on the Spec-Density rubric. Word count is not density. A spec that spends 800 words explaining the business context and 20 words on constraints scores 0 on constraints regardless of total length. The rubric measures what is present, not how much text surrounds it.

Trap 2: Accepting "see the code" as a failure mode

Engineers sometimes write "for error handling, see the existing error handler." That is not a failure mode in the Spec-Density sense. A failure mode is a condition the agent might encounter, not a code pattern in the surrounding infrastructure. The agent cannot read your error handler. It needs explicit instruction.

Trap 3: Scoring partial when the dimension is actually missing

The partial band exists for dimensions that are started but not finished. If a spec says "the agent should handle errors gracefully," that is not a partial score on Failure Modes — it is 0, because no failure mode is actually specified. Partial means: at least one concrete entry exists, but not enough entries to cover the known failure space. "Handle errors gracefully" is an aspiration, not an entry.

When to Score: The Spec Review Gate

The Spec-Density Score works best as a gate at a specific point in the agent development workflow: after the spec is drafted but before any code is written. Running the score at this point costs 15 minutes and potentially saves 15 hours of debugging a half-built agent.

Three useful insertion points for teams:

  • Pre-build gate: Any agent spec must score 65+ before the first implementation session begins. Below 65, the spec author rewrites the failing dimensions and re-scores.

  • Pre-production gate: Any agent going to production must score 80+. The gap between 65 and 80 is usually Rollback and edge-case Failure Modes — the dimensions that matter when the agent is running unattended.

  • Post-incident review: After any agent incident, score the spec that produced the failing agent. The dimension with the lowest score is almost always the root cause category. This is not blame assignment — it is a systematic way to identify which spec dimension your team habitually underweights.

Using the Score With AI-Assisted Spec Writing

If you use an LLM to help draft agent specs, the Spec-Density Score doubles as a prompt structure. Instead of asking "write me a spec for X," ask for each dimension explicitly: "List at least 4 failure modes for this agent, including a detection condition, agent response, and escalation path for each." Then score the output. Models that produce impressive-sounding but score-0 specs on Failure Modes will tell you exactly where to push back.

The score also catches prompt injection attempts in agent specs — a constraint dimension that scores 0 means the agent has no hard limits, which means a crafted input can redirect it arbitrarily. A spec that scores 17 on Constraints has explicit NEVER instructions that the agent can treat as inviolable, making injection harder to execute silently.

The Score Does Not Replace Judgment

A 100-point spec is not automatically a good spec. It is a complete spec. The score measures structural completeness — the presence of load-bearing information in each dimension. It does not measure whether the constraints are the right constraints, whether the examples cover the actual edge cases, or whether the rollback procedure is technically sound.

Think of it as a pre-flight checklist, not a quality guarantee. A pilot who completes every checklist item correctly is still responsible for whether the destination is correct. The Spec-Density Score tells you the plane has fuel, not that you should make the trip.

What it eliminates is the class of failures that come from forgetting to think about a dimension entirely — which, based on the agent builds that fail most visibly, is the majority of production incidents.

Before your next agent build: score the spec. If any dimension is below 8 points, stop and fix it. That fifteen-minute audit is the highest-ROI investment in any agent project, and it costs nothing but attention. You can browse WOWHOW's developer tools for automation and productivity tools that pair with agent workflows, or explore the full product catalog for starter kits that include pre-scored spec templates. If you want access to the downloadable Spec-Density scoring worksheet, it's available through WOWHOW Pro Vault.

People Also Ask

What is the Spec-Density Score for AI agents?

The Spec-Density Score is a WOWHOW 0–100 rubric for grading an agent spec before any code is written. It scores six dimensions: constraints, acceptance criteria, examples, failure modes, tool scope, and rollback. Each dimension is weighted at 17 points (rollback at 15). Specs below 50 are not build-ready.

What score does an agent spec need before going to production?

The WOWHOW Spec-Density framework sets 80 as the minimum threshold for a production agent. Scores between 65 and 79 are acceptable for staging or low-stakes automation. Anything below 65 is draft quality and will encounter at least one unhandled failure in the first 48 hours of a real run.

Why do agent specs fail even when the model is capable?

Agent failures almost never come from the model. They come from specs that leave runtime decisions unresolved. An agent with no failure modes listed will hallucinate a recovery path when it hits an error. An agent with no tool scope will call the most powerful tool available, which creates security and data-integrity incidents. The spec is the primary failure surface, not the model.

How is the Failure Modes dimension different from error handling in code?

Code error handling is about exceptions the runtime surfaces. Failure modes in the Spec-Density framework are about conditions the agent encounters at runtime that require a decision: what to do when an API returns 429, when a target directory is missing, or when the token budget runs out mid-run. Each entry needs a detection condition, an agent response, and an escalation path — not a generic catch block.

Can a 100-point spec still produce a failing agent?

Yes. The Spec-Density Score measures structural completeness, not correctness. A spec can score 100 because it has entries in all six dimensions, but still have constraints that are wrong for the specific task, examples that miss the actual edge cases, or a rollback procedure that is technically unsound. Think of it as a pre-flight checklist, not a guarantee the destination is right.

Originally published at wowhow.cloud

Top comments (0)