Your agent failed again, and your hand found the model dropdown before you'd finished reading the transcript. The model is the one part of your agent that is public, ranked, and argued about. Everything else is private, unglamorous, and yours. So you upgrade the layer you can see and leave the one actually failing untouched.
You have been debugging the layer easiest to talk about, not the one quietly costing you trust.
The reflex has a sophisticated form too. The builders who would never just click the dropdown reach instead for a thicker harness, a bigger context window, the framework everyone is posting about. It is the same move: spend on the part you can see so you do not have to diagnose the part you cannot.
The Seven Layers
I debug through seven layers now, in a fixed order, starting from the one whose damage reaches furthest. One question per layer, and the failure signature you see when that layer is starved.
| Layer | The question | Starved-layer signature |
|---|---|---|
| 7. Purpose | What job was the agent hired to do? | Wraps the right answer in the wrong task. Good code, wrong repo. |
| 6. Harness | What can the agent reach and what stops it? | Implements the tool, never reaches the goal. Polite loop. |
| 5. Memory | What survives between turns? | Re-learns Monday's correction on Thursday. Goldfish with good vocabulary. |
| 4. Verification | What checks whether the output is right? | Ships clean format, wrong substance. Green test suite, zero semantic coverage. |
| 3. Skills | What reusable procedures does the agent carry? | Good work repeated unreliably. Each call is a ground-up re-derivation. |
| 2. Data | What reality does the agent reason from? | Confident wrong answers from a world that moved. Stale facts quoted with certainty. |
| 1. Orchestration | Who decides what happens next? | Wanders, loops, or stalls. No structural feedback. |
Debug outside-in, design inside-out
You build from the top down: name the job (Purpose), bind the harness, package the skills. But you debug from the bottom up, starting at Data, the facts everything else reasons from, because the lower a starved layer sits, the further its fault has already spread.
Design outside-in. Debug inside-out. The dropdown reflex fails because it debugs at the design end of the stack — the top.
The data trap
A research agent that quotes prices, versions, and policies usually breaks at data. The retrieval is stale, the checks above it are sound, and the failure is a confident answer drawn from a world that moved. Reach for a bigger model and it delivers the out-of-date figure with more poise. The starved floor is lower than anyone was looking.
Find where the agent's facts enter: the retrieval call, the assumptions written into the system prompt, the document set. Then check one claim against its source. When the agent quotes a price, a version, a policy — can you name when that fact was last refreshed? Does it still match the world?
The verification trap
An agent that turns out clean, well-formed prose or code usually breaks one floor up, at verification. The easiest row to mark green by pointing at a passing test suite. Look at what those tests check. Most confirm the output is the right shape and almost none test whether it is right. That is a format check wearing a verification badge.
The trap is worse for an agent that rewrites its own behaviour, which can pass every check while drifting from the behaviour those checks were meant to protect. A green test suite can sit on a starved verification layer, and it is sometimes the most expensive way to stay blind.
How to run the audit
Take the seven questions in debug order, bottom to top. For each one, write the evidence in your repo that answers it: the file path, the asserted comparison, the memory store's last write. None of it needs to be a file — on a raw SDK loop, memory is whatever you persist between calls. The form does not matter.
Where you catch yourself writing a sentence about how you sort of handle that layer, instead of pointing at where it lives, you have found a starved row.
Most of the seven clear in a few minutes — the rows you already trust. Running all of them is what earns you the right to drop to the one or two that bite instead of guessing. The starved row is the one you have been compensating for by hand without naming.
Which layer is in fashion turns over — memory last year, context engineering now — but the audit is what tells you which one is yours. Before you upgrade the model, before you rebuild the harness, run the seven. The row that comes back red is the one already costing you trust, and naming it stops the wasted motion: you quit arguing with the model, quit rewriting prompts that were never the problem, quit adding memory to a layer whose facts were stale to begin with.
This is a shortened version of the full article at The Durability Curve, which includes the full framework, worked examples, and a free one-page audit printout.
Top comments (0)