DEV Community: Tisha

AIE WF Submission

Tisha — Thu, 09 Jul 2026 07:18:14 +0000

Why Agents Forget

Tisha — Sun, 07 Jun 2026 11:53:10 +0000

Your coding agent is better than it was a year ago, and it still forgets.

The assistants added a memory layer in the meantime. GitHub Copilot now carries your conventions across sessions, and ChatGPT keeps a running profile of what you tell it. But those features remember preferences, not work: they hold "use camelCase," not "we ruled out Redis for the cache last sprint because of the eviction policy." The model underneath still starts every call from nothing.

You see it the first time you leave an agent running across a day. It boots up, runs the tests, watches them fail, and works out from scratch, again, that the database has to be seeded first. Yesterday's run already learned that. Today's run has no way to know it.

From first principles, the reason is not mysterious. A language model is a pure function: its output depends only on its weights and the text in front of it. Nothing carries from one call to the next, because there is nowhere for it to go. The weights freeze after training, and the context window clears when the call ends. Everything we call memory is a workaround for that one fact: we store text outside the model and feed it back in next time. The 2020 RAG paper named the underlying problem and filed "updating world knowledge" under open. Five years on, it still is.

This was a footnote when agents answered one question and stopped. It is the main event now. We give agents work that runs for hours across many sessions, and an agent that carries nothing forward cannot improve. It can only get faster at starting over.

So the question this post takes seriously: how does an agent learn from what it already did? Every claim links to a primary source, because the field is loud and the details are where the truth is.

The north star

Where are we actually trying to go? Separate the two things people both call memory, because they are not the same operation.

	Persistence	Learning
What changes	the input (the prompt)	the function (the weights)
How	store facts, read them back	update the model from experience
Example	Copilot recalls your style	a model sharper at the task than last month
Where we are	shipping today	open frontier

Almost everything shipping today is the left column. Copilot recalls that you prefer camelCase. ChatGPT recalls your kid's name. Real and useful, and not the same as a model that is more capable in month six than in month one.

The north star is the right column: an agent that detects it keeps making the same mistake and stops, that internalizes the structure of your codebase and the failure modes of your stack. Anthropic framed memory in a recent talk as the primitive they believe turns agents into systems that improve from their own experience.

Are we there? No, and it is worth being exact about why. What we call learning today changes the text we feed the model, not the model. The weights are identical before and after. The competence is fixed; only the context improves. Most of the confusion in this space comes from not naming which column a system is in.

Where we are

How does an agent remember today? Reduce it to the two operations that matter and the rest is detail. On the way in, retrieval: what to pull into the context window. On the way out, writing: what to persist. Memory is the engineering around those two decisions, plus an offline pass that cleans up between runs.

Start with the constraint that forces the whole design. The context window is the only channel for new information at inference, and it is finite. The MemGPT paper frames this as an operating system managing scarce memory: the window is RAM, the external store is disk, the system pages between them. That is the one analogy worth keeping, because it is the architecture, not decoration.

What are we storing, exactly? CoALA names four kinds, and each maps to a different store and a different update rhythm.

Memory tier	Cognitive type	Infrastructure	Updated
In-context	Working	Context window (RAM)	Every model turn
Session log	Episodic	Temporal graph / append-only log	Live, during the session
Knowledge base	Semantic	Vector DB / document store	Read-heavy, slow writes
Runbooks & rules	Procedural	File system (markdown / wiki)	Offline consolidation

"Seed the database before the tests" is procedural. "This service runs on Postgres 14" is semantic. "The March migration broke staging" is episodic.

How does the right memory reach the window? The base case is RAG: split knowledge into the parametric memory in the weights and a non-parametric store you can search and update without retraining. In practice that store is a vector index. Embed everything, and at query time pull the passages nearest the query in vector space.

The trap is treating nearest as most useful. Similarity is not relevance, and one store cannot serve two different questions. A real retrieval step routes across stores in parallel, to stay inside a latency budget:

async def build_context(query, session_id):
    # episodic and semantic live in different stores; fetch concurrently
    episodic, semantic = await asyncio.gather(
        graph.recent_events(session_id, limit=5),            # what just happened, in order
        vectors.search(query, filter={"status": "active"})  # facts, minus the retired ones
    )
    if conflicts(semantic, episodic):     # cheap guard against the staleness failure below
        semantic = reconcile(semantic, episodic)
    return assemble(working=query, episodic=episodic, semantic=semantic)

Episodic state from a temporal store, semantic facts from a vector store, a metadata filter so retired facts never surface. Zep productizes exactly this pairing of a temporal graph with semantic search.

Where does memory come from? Hand-written notes cover what never changes and nothing else. The interesting systems generate it. Copilot proposes memories from your sessions for you to approve. A-MEM links each new note to related ones and lets a new write revise the old ones, so the store reorganizes as it grows. And memory has to forget, or it fills with noise; MemoryBank applies an Ebbinghaus forgetting curve so entries decay with age and strengthen with use.

When does the cleanup run? Increasingly offline. Generative Agents introduced reflection in 2023, synthesizing raw observations into higher-level conclusions on a schedule. Letta turned it into sleep-time compute, doing the thinking while no request is waiting, which cut query-time work by roughly five times on their benchmarks. Anthropic ships a version it calls dreaming that mines recent sessions and curates shared memory between runs; the specifics come from a talk, so treat them as preliminary. The principle is constant: move the expensive remembering off the critical path.

The problems

So why is this not solved? Each layer has a failure mode the demos skip. There are six worth knowing.

Failure	Why it happens	The fix
Staleness	similarity has no clock	recency decay + a temporal store
Bloat	save everything, bury the signal	importance scoring, active forgetting
Evaluation	quality was self-reported	your own eval, from day one
Cost	every call re-bills the whole context (prefill)	retrieve less, cache, curate harder
Context rot	attention thins as the window grows	minimal high-signal context
Fleet conflicts	many agents, one store	optimistic concurrency + versioning

Three of these deserve more than a row.

Staleness is the one that bites in production. The agent learned Postgres 13. You moved to 14. The old note is still the nearest match, so the agent retrieves it with confidence. Vector similarity has no concept of time. Generative Agents already showed the fix: score memories not by similarity alone but by similarity weighted by recency and importance. Reduced to one line, the recency-aware score is

score(m) = similarity(query, m) · e^(−λ · Δt)

where Δt is time since the memory was last verified and λ is a decay constant you tune per fact type: high for infra versions, near zero for a person's name.

Add the metadata filter from the router and a temporal store that records when each fact was true, and staleness goes from silent to detectable. The exponential-decay form is the same idea MemoryBank borrowed from the forgetting curve.

Context rot is the subtle one. It is not a metaphor; it is Anthropic's documented term: as the number of tokens in the context window increases, the model's ability to accurately recall from that context decreases. So stuffing in more retrieved memory past a point makes answers worse, not better. More tokens is not more intelligence. Selection is the whole job.

Cost is the one that quietly kills long-running agents. Every call re-bills the entire context as input tokens, and processing those input tokens, the prefill, is most of what you pay for on a long history. Prompt caching softens the repeated prefill but does not erase it. An agent leaning on a bloated memory pays that tax on every turn, and past a point the workflow is not slow, it is financially non-viable. The discipline is the same one context rot demands: feed the model the smallest set of high-signal tokens, not everything you retrieved.

Under all six sits the real one. None of this changes the model. We are refining the text we hand a fixed function, because changing the function means updating weights from live experience, and nobody has shown how to do that safely, cheaply, and without the model drifting. Every technique above is a way to avoid that wall.

How we get there

So how do we close the gap? Three horizons.

Horizon	What it looks like	Status
Now	hybrid retrieval, timestamps, consolidation, your own evals	available today
Medium	memory as infrastructure: permissions, versioning, audit	emerging
Long	the model learns into its own weights at runtime	open frontier

If you're building one today:

Retrieve with vectors and keywords, so exact names and terms aren't lost to fuzzy similarity.
Timestamp every fact and decay it at retrieval, so a stale note can't outrank the current one.
Add an offline pass between sessions, so the agent stops relearning the same lessons every run.
Write your own eval first, because a leaderboard score is not your workload.

Medium term, memory becomes infrastructure. Permissions, so an agent can read the runbook but not corrupt it. Versioning and audit logs, so you can see what it stored, when, and why, and roll it back when it is wrong. Memory stops being a text blob and starts being a system with history.

Long term is the open frontier. To close the loop, the agent has to learn into the model, not into a file beside it. Google's Titans is an early move: a neural memory that updates as it runs, using a surprise signal to decide what to commit, attention serving as short-term memory and the module as long-term. But writing experience back into weights without the model drifting or degrading is unsolved, and anyone claiming otherwise is selling something.

That is the map. Persistence is largely solved. Consolidation is arriving. Learning into the weights is the frontier, and it is where "self-learning agent" either earns the phrase or stays a slogan.

The point

We did not make agents forgetful on purpose. It falls out of how the models work, and nearly everything we have built compensates from the outside. It works well now. A modern agent on a real memory system can answer as if it knows you.

Knowing you is not the same as improving. The day an agent stops needing the note, because the lesson is actually in the model, is the day this stops being a workaround and becomes memory. We are not there. Now you know the exact shape of the gap, and why closing it is the whole project.

Sources: MemGPT · RAG · CoALA · Generative Agents · Titans · Zep · A-MEM · MemoryBank · Sleep-time Compute · Anthropic: context engineering · Anthropic: Memory and dreaming (talk; preliminary) · OpenAI: ChatGPT memory · GitHub Copilot Memory

Your Agent Failed in Prod. Good Luck Reproducing It.

Tisha — Wed, 03 Jun 2026 10:16:43 +0000

9:04 a.m.

A ticket lands. A customer ran your agent yesterday, it called the wrong tool, deleted the wrong record, and now there is a screenshot in your inbox with a red box drawn around the damage. You have the user ID. You have the timestamp. You copy the exact prompt out of the logs, paste it into the same model, with the same system prompt, and hit run.

It works perfectly.

You run it again. It works again. You run it ten more times. The agent behaves like a model employee every single time, and the one run that mattered, the one that cost a customer their data, is nowhere. You cannot make it happen again, which means you cannot debug it, which means you cannot promise it will not happen to the next customer.

This is the reproducibility problem, and if you are shipping anything built on a large language model, it is already your problem. This post is about why it happens, why some of it is actually a feature you do not want to remove, and what you can do to get back the one thing you need: the ability to replay a run exactly as it happened.

What "reproducible" even means here

Most teams use the word to mean two different things and then argue past each other. Pull them apart and the whole topic gets clearer.

The first meaning is bitwise determinism: the same input always produces the identical output, token for token. This is what you assume you have with ordinary software and what you almost never have with an LLM.

The second meaning is replayability: given a run that already happened, you can reconstruct exactly what occurred, the inputs, the sampled outputs, the tool calls, the intermediate state, well enough to debug it. You do not need the model to be deterministic. You need the run to be recorded.

The trap is chasing the first when you actually need the second. Teams spend weeks trying to force their model into bitwise determinism, fail, and conclude the system is unknowable. It is not. You were aiming at the wrong layer.

Temperature zero will not save you

The first thing everyone tries is setting temperature to zero. The reasoning is clean. Temperature controls randomness in sampling. Set it to zero and the model must pick the single most probable next token every time, which is greedy decoding, which should be deterministic. One input, one output, forever.

In theory, yes. In practice, run the same prompt twice at temperature zero and sooner or later the outputs diverge. It often starts with one word, the sentence takes a slightly different turn, and the rest drifts away from there. The reason is the distinction that fixes most of the confusion in this whole area, and it comes from Sara Zan's write up on the topic: sampling determinism is not the same thing as system determinism.

A quick piece of vocabulary, because it shows up everywhere from here on. Before the model emits a token, it produces a raw score for every candidate token in its vocabulary. Those scores are called logits. Picking the token with the single highest logit is an operation called argmax, literally "the argument that gives the maximum." Greedy decoding is just argmax at every step.

So temperature zero makes the selection rule deterministic. Always take the argmax. But it does nothing to guarantee that the logits you are taking the argmax over are identical from one run to the next. If two candidate tokens have logits that are almost tied, a difference in the last few bits is enough to swap which one wins, and once one token changes, every token after it is generated from a different prefix, so the divergence compounds.

So the question becomes: why would the logits ever differ between two runs of the same model on the same input?

The original sin: floating point is not associative

Here is the part that surprises people who have not stared at numerical code. With real numbers, addition is associative. (a + b) + c equals a + (b + c). With floating point numbers it does not, because every intermediate result is rounded to finite precision. The canonical demonstration, from the Thinking Machines write up by Horace He and collaborators:

(0.1 + 1e20) - 1e20  =  0
0.1 + (1e20 - 1e20)  =  0.1

Same three numbers, different grouping, different answer. This is not a bug. It is the price floating point pays for representing both enormous and tiny values with a constant number of significant figures.

Now scale that up. A transformer forward pass, one full run of the model over the input, is millions of additions, multiplications, and reductions across matrix multiplications, normalizations, and attention. Change the order in which any of those reductions accumulate and you change the last few bits of the result. Change the last few bits of a logit and you can change which token is the argmax. That is the chain from low level arithmetic all the way up to a different sentence.

The real culprit is not the one everyone names

The common explanation stops at floating point plus concurrency. In one line: thousands of GPU threads finish in an order nobody controls, and because floating point addition is not associative, adding the same numbers in a different order gives a slightly different sum, so the output wobbles from run to run. It sounds complete. It is wrong, and the Thinking Machines analysis is the clearest debunking of it.

Here is the inconvenient fact that breaks the popular story. Run the same matrix multiplication on the same GPU on the same data a thousand times and you get bitwise identical results every single time:

A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
ref = torch.mm(A, B)
for _ in range(1000):
    assert (torch.mm(A, B) - ref).abs().max().item() == 0

Floating point is in play. Massive concurrency is in play. And yet the result is perfectly reproducible. So concurrency plus floating point cannot be the whole answer.

The true culprit is batch invariance, or rather the lack of it. Production inference servers do not run your request alone. They batch it together with whatever other requests happen to arrive at the same moment, for efficiency. The kernels, the low level GPU routines that compute your output, run reductions inside normalization, matrix multiplication, and attention whose results depend on the shape of the batch they ran in. The forward pass is deterministic for a fixed batch. But the batch is not fixed. It depends on concurrent load, on who else is hitting the server in the same millisecond, on conditions you do not control and cannot see.

So your prompt is identical, your parameters are identical, and the thing that changed is the company you were keeping inside the server. This is also why a prompt looks rock solid in local testing and turns flaky in production. The model did not get more creative. The batching conditions changed.

The Thinking Machines team showed both the scale of the problem and the fix. Running standard vLLM, a thousand identical prompts to Qwen-3-8B produced eighty distinct completions. With batch invariant kernels, the ones that produce the same result regardless of batch shape, the same thousand prompts produced exactly one. The cost was real but modest, one of their tests went from twenty six seconds to forty two. Their library, batch-invariant-ops, has since been picked up by SGLang. The three operations that have to be made batch invariant are RMSNorm (a normalization step), matrix multiplication, and attention.

The lesson: true bitwise reproducibility is achievable, but only by controlling the entire inference stack down to the kernels. Almost no one calling a hosted API has that control.

Mixture of experts adds another door

In one line: a mixture of experts model is one large network split into many smaller specialist subnetworks, with a router that sends each token to only a few of them instead of running the whole model every time. Many frontier models are built this way, and the architecture is a second independent source of the same problem. If that routing were per token and independent, it would be deterministic. It is not, and the reason is a number called the capacity factor.

Each expert can only process so many tokens in a given batch. That ceiling is the capacity factor: a threshold on how many tokens one expert will accept before it is full. When too many tokens in a batch all want the same expert, the ones over the limit cannot all be served. The overflow tokens get bumped to their second choice expert, or dropped from that layer entirely. So whether your token reaches its first choice expert depends on how many other tokens in the same batch were competing for it.

That is the same trap as batch invariance, wearing a different costume. The routing decision for your token is not a function of your token alone. It is a function of the whole batch your token landed in. As Vincent Schmalbach lays out, this makes a mixture of experts model deterministic at the batch level and nondeterministic at the level of a single sequence. Send the same prompt twice, get it batched with different neighbors, and the capacity math resolves differently, so your tokens route differently. Same root cause, a second mechanism delivering it.

The full 360: everything that moves under you

Sampling and kernels are only the inference layer, and they are just two of about eight things that moved under you between yesterday's run and today's. In a real agent they are often among the most stable. Here are the other six, and every one of them can change the output even if the model itself were frozen solid.

The prompt is rarely fixed. Interpolate the date, the user's name, a feature flag, or a sampled few shot example, and the "same" prompt is not the same prompt.

The context is assembled at runtime. Retrieval pulls from an index that updates continuously, so yesterday's chunks are not today's chunks.

Tools return live data. A weather call, a database read, a search API each return something different every time, and the model reasoned over a world state you did not capture.

Time leaks in. "Schedule it for next Tuesday" resolves to a different date depending on when it runs.

The model version drifts. The gpt-4o or claude you called last month may be a different set of weights this month, with no version bump you controlled.

Conversation history accumulates. In a multi turn agent, earlier turns are part of the input, so if any one of them varied, every later turn inherits it.

This is the part most reproducibility discussions miss by staring only at temperature. The sampler is one knob on a machine with eight. To reproduce a run you have to pin all eight, not just the one.

Wait: we actually want some of this

Before going further, we have to be honest about something, because the obvious reaction to everything above is "fine, make it all deterministic and be done." Do not. If you could flip one switch and make your model perfectly deterministic, token for token, forever, you should not flip it. The nondeterminism that wrecks your reproducibility is the same property that makes the model good. The argument has four parts, and most teams only know the first.

Quality: greedy decoding is not safe, it is broken. The intuition is that always taking the single most probable token is the careful, conservative choice. It is not. Holtzman and colleagues showed in their 2020 nucleus sampling work that maximization based decoding, greedy and beam search, drives open ended generation into bland, repetitive, looping text that humans immediately recognize as machine written. Their conclusion was blunt: maximization is the wrong objective for open ended text generation. The fix is to sample, but only from the reliable head of the distribution, truncating the unreliable tail. That is nucleus sampling, the top-p knob, usually set around 0.95. The variation is not decoration. Switch it off and the prose collapses.

The knobs, named. When we say variation we mean three specific controls. Temperature reshapes the distribution before sampling: low values near 0.2 make it peaky and conservative, higher values near 0.8 to 1.0 flatten it and admit more surprising tokens. Top-k restricts sampling to the k most likely tokens (Fan and colleagues). Top-p, nucleus sampling, restricts it to the smallest set of tokens whose probability mass exceeds p (Holtzman and colleagues). These are the levers. Everything downstream is a consequence of how you set them.

Accuracy: sampling can make the model more correct, not less. This is the part that converts skeptics, because it is a number rather than a preference. Self consistency (Wang and colleagues, ICLR 2023) throws away the single greedy answer entirely. It samples many diverse reasoning paths, around forty, at temperature 0.7 with top-k 40, then takes the majority vote over the final answers. The gains are large and consistent: plus 17.9 percent on GSM8K, plus 11.0 on SVAMP, plus 12.2 on AQuA, plus 6.4 on StrategyQA, plus 3.9 on ARC challenge. The mechanism is the one that makes random forests beat a single decision tree. Diverse samples, aggregated, beat one confident guess. Determinism would have handed you exactly one path, and a worse answer.

Exploration: agents need to try things. Anything that searches depends on variation. Best of N sampling generates many candidate completions and keeps the best under some scorer, and coverage, the chance that at least one of N samples is correct, climbs with N only because the samples differ. Agent loops that retry a failed tool call, propose alternative plans, or branch are running the same exploration versus exploitation tradeoff that reinforcement learning has always lived on. A perfectly deterministic agent retries the identical failing action forever. Variation is what lets it escape.

Discovery: sampling has found things humans had not. The strongest version of the argument is no longer hypothetical. DeepMind's FunSearch (Nature, 2023) paired a pretrained LLM with an automated evaluator in an evolutionary loop: sample candidate programs, keep the ones that score, mutate those, repeat. It solved the cap set problem in extremal combinatorics, a question Terence Tao had called a favorite open problem, producing the first new discovery by an LLM on a problem of that difficulty, in collaboration with Prof. Jordan Ellenberg. Its successor AlphaEvolve (2025) used an ensemble of Gemini models as mutation operators to evolve entire codebases, and the results shipped. A data center scheduling heuristic that recovers on average 0.7 percent of Google's worldwide compute and has run in production for over a year. A matrix multiplication kernel sped up 23 percent that cut Gemini's own training time by 1 percent. A procedure to multiply two four by four complex matrices in 48 scalar multiplications, the first improvement over Strassen's algorithm in that setting in 56 years. A later study with Tao and collaborators ran it across 67 problems in analysis, combinatorics, geometry, and number theory. None of that happens with the temperature pinned to zero. The diversity of the samples is the search.

The reconciliation

So which do we want, variation or determinism? Both, and the reason they do not contradict is that they live at different layers.

We want variation at generation time, because that is where quality, accuracy, exploration, and discovery come from. We want determinism at replay time, because that is where debugging, regression testing, and incident response come from.

The mistake teams make is trying to buy reproducibility by killing generation time variation. That is the wrong layer, and it costs you everything in the four sections above. You do not freeze the model. You capture what it did. The inputs, the sampled outputs, the tool calls, the retrieved context, the model version, the timestamp. Then you replay the captured run, not a fresh generation. Keep the creativity. Record the evidence.

Record and replay

The technique that resolves the whole tension is borrowed from a decades old idea in software testing: record the real interaction once, replay it forever. There are three distinct jobs it does, and it helps to keep them separate.

Post mortem debugging. When the 9:04 ticket arrives, you do not re run the model and hope. You pull the recorded run: the exact assembled prompt, the exact sampled completion, the exact tool inputs and outputs, the retrieved chunks, the model version string. Now the bad run is in front of you, frozen, and you can actually trace what happened. This is the capability you were missing in the opening story.

Concretely, the recording is one envelope per run. Capture it on the way out, so the agent writes its own black box recorder:

record({
    "run_id": run_id,
    "timestamp": now_iso(),
    "model": resolved_model_version,     # not the floating alias
    "params": {"temperature": 0.7, "top_p": 0.95},
    "system_prompt_hash": sha256(system_prompt),
    "messages": messages,                # the assembled prompt
    "retrieved_chunks": [c.id for c in chunks],
    "tool_calls": tool_calls,            # name + args, as sent
    "completion": completion,
})

When the ticket lands, that envelope is the whole crime scene, frozen:

{
  "run_id": "a3f9c1",
  "messages": [{"role": "user", "content": "clean up the inactive accounts in staging"}],
  "retrieved_chunks": ["runbook_staging_cleanup"],
  "tool_calls": [{"name": "delete_accounts", "args": {"target": "production", "filter": "status = inactive"}}],
  "completion": "Done, I cleared the inactive accounts."
}

Now look at what the envelope rules out. The user said staging. The retrieved chunk runbook_staging_cleanup is the correct runbook and it says staging. The assembled prompt is clean. And yet the tool call went to production. Nothing in the context explains the swap, and that is the whole point. The retrieval was right and the prompt was right, so the failure did not live in your data pipeline. It lived in generation. Your request was batched with sixty four others that millisecond, two candidate tokens for that argument sat almost tied, one logit crossed its neighbor, and production won where staging should have. Replay the same prompt alone and it behaves, because the batch that tipped it is gone. The envelope is what lets you say that with confidence: the inputs were perfect, so stop grepping your retriever and go read the sampler. This is the failure the first half of this post was about, caught in the act.

Capturing all of this in production is not free, and that is the honest tension. Recording every run means writing the assembled prompt, the retrieved chunks, every tool input and output, and the model version to durable storage on the hot path, which costs storage and adds a little latency to each request. Those payloads also carry whatever the user typed and whatever your retriever pulled, which in most enterprises means customer PII headed for durable storage, so a deterministic redaction pass has to scrub the envelope before it ever reaches the recorder, not after it lands. Open instrumentation standards like OpenInference, and tracing backends like Phoenix, exist to make this routine: they capture the spans of an agent run as structured telemetry and stream the payloads to a data store you can query later. The practical move is to record the full envelope for everything in production but down sample or expire it, keep every run for a few days so the 9:04 ticket is always answerable, and keep the interesting runs, the failures, the flagged ones, forever. The same envelopes you captured in production are what you replay in CI.

Experience reuse. A recorded run is also a cache. If the same inputs come around again, you can serve the recorded output instead of paying for another generation, which is faster and free.

Deterministic CI. CI is the automated test suite that runs every time someone pushes code, and you want it deterministic, meaning the same code always gives the same pass or fail instead of flaking at random. This is where most teams adopt record and replay first, and the motivation is brutal and practical. The Learnixo write up names the three problems precisely. Cost: every real call to a hosted model during CI burns budget, and at fifty developers times ten pull requests times twenty tests, it adds up fast. Non determinism: a test that asserts the output equals an exact string fails most of the time, because the model does not return the same string twice. Latency: a real call takes two to ten seconds, so a suite with thirty of them takes minutes, which kills the fast feedback loop that makes CI worth having.

Record and replay fixes all three at once. Record the real responses once, replay them on every subsequent run. Tests become free, deterministic, and fast.

The trap in the fix cycle

There is a catch that a sharp reviewer will find in about ten seconds, so let us find it first. Record and replay is a superb post mortem tool. It is a bad fix verification tool, and the reason is the same nondeterminism we have been chasing the whole way down.

Walk the loop. The 9:04 envelope tells you the agent emitted production where it meant staging. You write a fix: a tighter system prompt, a guard on the tool, a reworded instruction. Now you want to prove the fix works. But the moment you change the prompt, the input hash changes, so the recorded run no longer matches and your replay is a cache miss. A miss falls through to a live call, and a live call is back in the land of batching and logit flips, the exact thing you could not reproduce in the first place. Even with the input held byte for byte identical, regenerating re batches your request with whoever else is on the server, so the flip you are trying to squash may simply not fire today. You cannot confirm a fix by replaying the model, because replaying the model is not deterministic and a fix by definition changes the input.

The way out is to stop asking record and replay to do a job it cannot, and to split testing into two layers.

Layer one, exact match replay for control flow. Freeze the captured context as a fixture and assert on structure, not prose. Given this exact prompt and these exact retrieved chunks, does the agent take the same path, call the same tool, with an argument of the right shape and the right target? This layer is deterministic and free because it never calls the model. It catches the regression that matters most here: the guard you added must make the destructive target impossible, and a frozen fixture proves it without a single live token.

Layer two, semantic judgement for the parts that are allowed to change. When the thing you changed is the wording of a prompt or the model version, bitwise equality is the wrong assertion, because the whole point of the change is that the text will differ. Here you run the candidate against the recorded context and score the output with an evaluator, an LLM as a judge, that asks whether the new answer means the same thing as the old one rather than whether it matches word for word: did the answer stay grounded in the chunk, did it refuse the destructive call, did it preserve the meaning of the gold response. The recorded envelope becomes the regression fixture, the judge accepts any output that means the right thing.

That is the loop closed. The envelope you captured in production verifies structure deterministically and meaning semantically, and neither layer asks a nondeterministic system to repeat itself on command.

How to actually do it in your test suite

There is a small ecosystem for this in Python, and the right answer is a layered strategy rather than a single tool. But first, a warning about which layer you record at, because the obvious one is the wrong one.

The instinct is to mock the network: intercept the HTTP call to the model and replay the bytes. For a single synchronous request that works. For a real agent it breaks, and it breaks in exactly the conditions you ship in. Token streaming with stream=True turns one response into a long lived chunked transfer that network cassettes mangle. Concurrent asyncio event loops interleave several model calls over the same connection. HTTP/2 multiplexing carries multiple requests down one socket at once. Record at the socket and you are trying to freeze a river.

Record one level up instead. Mock at the framework or orchestrator boundary, the provider your agent calls through, and override the step function of the agent loop rather than the network underneath it. Call it deterministic graph state hydration: you are capturing the internal state transitions of the execution graph, the prompt that entered a node and the structured output that left it, not the raw packets in between. This is the difference any good review will probe, raw network payloads versus the internal state machine of the agent, and the agent state is the layer that actually replays cleanly. The tools below sit at different points on that spectrum, and the first thing to know about each is which layer it records at, because one of the most popular ones records at exactly the layer this section just told you to avoid.

VCR style cassettes, the legacy layer. The oldest approach in this family is VCR.py with the pytest-recording plugin, and it is worth being precise about where it sits, because it is the socket level recorder this section just warned you about. VCR.py works by monkeypatching Python's low level HTTP machinery, urllib3, aiohttp, the socket calls underneath your client, and taping the bytes that cross the wire into a YAML "cassette" on first run, then replaying those bytes on every run after. You mark a test and forget about it:

@pytest.mark.vcr()
def test_agent_response():
    result = get_agent_response("Explain recursion in one sentence.")
    assert "recursion" in result.lower()

First run hits the real API and writes the cassette. Every run after reads from the cassette, no network, no cost, identical bytes. The one thing you must not skip: the default cassette captures your Authorization header and API key in plaintext. Redact them in your config before anything touches version control:

@pytest.fixture(scope="module")
def vcr_config():
    return {"filter_headers": [("Authorization", "DUMMY_API_KEY")]}

That is genuinely useful for the narrow case VCR was built for: a single synchronous request to a simple REST shaped endpoint, where the wire bytes and the logical call are the same thing. It is also the layer that breaks on streaming, concurrent event loops, and HTTP/2 multiplexing, which describes most real agents. So treat VCR as the historical default, the thing teams reached for before agents grew orchestration layers, not as the place to record an agent loop.

The graph boundary equivalent is to let your framework hand you the state it already tracks, so you never touch a socket. LangGraph checkpointers persist the state at each node transition, so you can freeze the input that entered a node and the output it produced and replay that pair directly. LlamaIndex workflows expose the same idea through their event stream, every step's input and output as a structured object you can capture and feed back. And when you have rolled your own orchestration, the move is a mock at the provider seam, the one function your agent calls to reach the model, returning recorded structured outputs keyed by the canonicalized request. All three record the meaning of a step rather than the packets that carried it, which is the property that survives streaming and concurrency. That is the true graph boundary hydration the rest of this section is built on, and it is why the cleaner patterns below all mock above the wire, not on it.

Exact match fixture replay. A lighter weight pattern, shown by the llm-fixture-replay library, stores each request and response pair as one line of JSONL, keyed by a SHA-256 hash of the canonicalized request. Replay looks for an exact match. Change the model, the messages, or any parameter and it is a miss, which is exactly what you want, because a changed input should invalidate the recording. Auto mode replays on a hit and records on a miss, so a new test extends the fixture file automatically, and committing that file makes every later run fully offline.

The whole core is about ten lines. Hash the call arguments, look for that key in the fixture, replay on a hit, and call the real function only on a miss:

def call(self, fn, **kwargs):
    key = hashlib.sha256(
        json.dumps(kwargs, sort_keys=True, default=str).encode()
    ).hexdigest()
    for entry in self._entries:
        if entry["key"] == key:
            return entry["response"]   # replay on hit
    response = fn(**kwargs)            # record on miss
    self._entries.append({"key": key, "response": response})
    self._path.open("a").write(
        json.dumps({"key": key, "response": response}, default=str) + "\n"
    )
    return response

Sort the keys before hashing so the same request always lands on the same key. That one line is what makes the lookup stable across runs.

Sorting the keys is necessary but not sufficient, and this is where the pattern quietly fails on real agents. json.dumps(kwargs, default=str) is stable for a flat dict of strings. Point it at an agent state full of nested Pydantic models, datetimes, and system objects and default=str will happily serialize a timestamp, an object id, or a memory address that is different on every run, so the same logical request hashes to a new key each time and every lookup misses. The fix is semantic canonicalization before you hash: strip the transient metadata that has no business in the key, the timestamps, the trace ids, the run ids, stabilize whitespace, and recursively sort nested structures so two equivalent states produce one canonical form. Hash the meaning of the request, not its incidental wire encoding. Without that step your fixture grows a new entry every run and replays nothing.

Concretely, canonicalization is a recursive pass that normalizes the types you recognize and refuses the ones you do not, so an opaque object becomes a loud failure you fix rather than a silent memory address that poisons the key:

TRANSIENT = {"run_id", "timestamp", "trace_id", "created_at"}

def canonicalize(value):
    if isinstance(value, datetime):
        return value.isoformat()                 # stable string, not the clock object
    if isinstance(value, dict):
        return {k: canonicalize(v)
                for k, v in sorted(value.items())
                if k not in TRANSIENT}            # drop transient keys, then sort
    if isinstance(value, (list, tuple)):
        return [canonicalize(v) for v in value]
    if isinstance(value, (str, int, float, bool, type(None))):
        return value
    raise Unserializable(type(value))            # opaque object: fix it, never str() it

key = hashlib.sha256(
    json.dumps(canonicalize(kwargs), sort_keys=True).encode()
).hexdigest()

The difference from default=str is the final line. default=str says yes to everything, including the object whose repr changes every run, so the instability slips into the key unnoticed. Canonicalization refuses what it cannot stabilize, and that refusal is what forces the key to stay constant across runs.

Zero config mocks. When you do not even want a real response, a mock library like pytest-mockllm gives you a fixture that returns whatever you tell it, with no API key and no setup:

def test_chatbot(mock_openai):
    mock_openai.add_response("I can help with your order.")
    response = my_chatbot.chat("I need help")
    assert "order" in response.lower()
    assert mock_openai.call_count == 1

The layering. Use mocks for unit tests, where you are testing your own control flow and the model's content is irrelevant. Use recorded fixtures for integration tests, where you want realistic model output but deterministic and free. Keep a small number of live tests, but run them on a schedule rather than on every commit, because they do a job the fixtures structurally cannot. Record and replay buys you a deterministic CI pipeline, but it also blinds that pipeline to the one failure that originates upstream: a provider silently changing the weights behind a stable alias. An exact match fixture sails through that change, because it never makes the call and faithfully replays yesterday's cached response, while production breaks the instant the real model shifts. A scheduled live canary, a handful of real calls run nightly against the pinned alias, is the only thing watching that seam, and it is also where a prompt change that silently degrades quality finally shows up. One more practical move from the Learnixo playbook: swap the real client for a mock behind an environment flag and a dependency injection point, so the same code path runs both ways and you flip it per environment.

A playbook you can apply on Monday

Pulling it all together into something actionable.

Stop chasing bitwise determinism through the API. Unless you own the inference stack down to the kernels, you cannot get it, and you would not want to pay its quality cost if you could.
Pin everything you can pin. Pin the model version explicitly rather than trusting a floating alias. Log it with every call so you know when it drifts under you.
Capture the full run, not just the prompt. Record the assembled prompt, the parameters, the sampled output, every tool input and output, every retrieved chunk, and the timestamp. The model is one of eight moving parts. Record all eight.
Replay for debugging, do not regenerate. When something breaks, reconstruct the frozen run from the recording. A fresh generation is a different run and tells you nothing about the one that failed.
Record at the graph boundary, and test in two layers. Capture the state transitions of the agent loop, not raw sockets, so streaming and concurrency cannot corrupt the recording. Then split your suite: exact match replays on the frozen context for control flow, and an LLM as a judge scoring semantic equivalence for anything whose wording is allowed to change. The first layer proves structure, the second verifies a fix without asking a nondeterministic model to repeat itself.
Keep generation time variation alive. Do not let the pursuit of reproducible tests push you into greedy decoding in production. Determinism belongs in replay, not in generation.

What is still unsolved

It would be dishonest to end on a tidy bow. Several hard parts remain open.

Batch invariant kernels exist but are not the default, and they cost throughput, so most hosted providers do not run them and you cannot make them. Recording the full context of an agent run is straightforward in principle and tedious in practice, and the more tools and retrieval your agent uses, the more surface there is to capture and the easier it is to miss one. Model version drift on hosted endpoints is largely outside your control, and a provider can change the weights under a stable name. And there is a genuine philosophical tension we did not resolve so much as relocate: the field is actively building systems whose value comes from exploring nondeterministically, while simultaneously needing those same systems to be auditable and reproducible. Those two goals pull in opposite directions, and the layered answer, vary in generation, freeze in replay, is the best current reconciliation, not a final one.

Back to 9:04

The ticket that opened this post is unanswerable in a world where you can only re run the model and watch it behave. It becomes routine in a world where every run was recorded. The customer's run is right there, frozen, with the prompt and the tool calls and the retrieved context exactly as they were. You see the agent call the wrong tool, you see the input that led it there, and you write the fix.

You did not make the model deterministic. You never needed to. You made the run reproducible, which is the only thing you needed all along.

Sources

Holtzman, Buys, Du, Forbes, Choi. "The Curious Case of Neural Text Degeneration." ICLR 2020. arXiv:1904.09751.
Wang, Wei, Schuurmans, Le, Chi, Narang, Chowdhery, Zhou. "Self-Consistency Improves Chain of Thought Reasoning." ICLR 2023. arXiv:2203.11171.
He and collaborators. "Defeating Nondeterminism in LLM Inference." Thinking Machines Lab, Sep 10 2025.
Zan. "Setting the temperature to zero will make an LLM deterministic?" Mar 24 2026.
Romera-Paredes and colleagues. "FunSearch." Nature, 2023.
AlphaEvolve white paper, DeepMind, 2025.
Georgiev, Gomez-Serrano, Tao, Wagner. "Mathematical exploration and discovery at scale." arXiv:2511.02864.

Spec-Driven Development: When Structure Helps and When It Becomes Tax

Tisha — Mon, 01 Jun 2026 12:01:02 +0000

Disclosure: I work at Microsoft. The views here are my own, and I've kept the tool comparisons evidence-based.

1. The Ambiguity Tax

Every vague requirement you hand an AI coding agent gets paid for later: in rework, in drift, in three files that each solved a slightly different version of the problem you never fully stated. I call this the ambiguity tax, the compounding cost of letting an automated loop run on under-specified intent. A human engineer fills gaps with judgment and a quick Slack message; an agent fills them with confident guesses and then builds on those guesses at machine speed. By the time you read the diff, the misunderstanding is load-bearing.

Spec-driven development (SDD) is, at its core, a strategy for paying this tax up front when it's cheap, instead of at review time when it's expensive. But there's a second tax most SDD advocates never mention, and it's the more interesting one.

2. First, Define the Artifact

Before the philosophy, the noun. A spec, in this context, is not a Word document handed down from a product manager. It's a versioned, reviewable artifact that carries engineering intent into the agent's context: a file (or set of files) that lives in the repo, moves through code review, and constrains what the agent generates. That's the whole shift. Intent moves out of ephemeral chat history and into something you can diff, comment on, and roll back.

3. What SDD Actually Means

Spec-driven development is the practice of making the spec, not the conversation, the primary unit of engineering work when collaborating with an AI agent. Instead of "prompt, code, fix, prompt again," you get "spec, plan, tasks, code, verify against spec." The artifact is the source of truth and the chat is just how you edit it. This sounds like a pure win. It isn't, which brings us to the tradeoff.

4. The Core Tradeoff

SDD lives between two failure modes. Too little structure produces the ambiguity tax: the agent guesses, drifts, and fragments. Too much structure produces what I'll call the Law of Surplus Structure: every extra rule consumes the agent's finite reasoning budget, whether or not it reduces uncertainty. The entire craft of SDD is finding the floor of that curve, enough structure to kill ambiguity, not so much that you're burning tokens to enforce ceremony. Hold that U-shape in your head; everything below is about locating its bottom.

The picture is the whole argument. Ambiguity cost falls fast as you add the first bits of structure, then flattens. Surplus-structure cost starts near zero and climbs as ceremony piles up. Total cost is their sum, and it bottoms out well before "maximum structure." Everything past that minimum is you paying to make the agent dumber.

5. The Taxonomy: Three Levels of SDD

Birgitta Böckeler's framing is the cleanest I've found: SDD isn't one thing, it's three levels of commitment.

Level	What persists	Who edits what	The spec is…
Spec-first	Code. Spec is scaffolding.	You edit code after generation.	A starting prompt you discard.
Spec-anchored	Spec and code, kept in sync.	You edit both; spec is reviewed.	A durable contract.
Spec-as-source	Spec only. Code is a build output.	You edit only the spec.	The source of truth; code is compiled from it.

Most teams think they're doing spec-anchored. Most are actually doing spec-first with extra steps: they write a spec, generate from it, then never touch it again. That's fine, as long as you're honest that the spec was a prompt, not a contract.

6. The Canonical Lifecycle Loop

Strip away the tool branding and nearly every SDD workflow is the same six-stage loop.

Stage	Question it answers	Output
Explore	What exists? What's the terrain?	Shared understanding
Specify	What should be true when we're done?	The spec
Plan	How will we get there?	Technical approach
Tasks	What are the discrete steps?	Ordered work items
Implement	Build it.	Code
Verify	Does it match the spec?	Pass/fail + evidence

Tools differ mostly in which stages they automate, which they force you to do explicitly, and how much each artifact weighs.

7. The Ecosystem, Reframed by Architecture

Most SDD tool round-ups list features. More useful is to sort tools by which architectural layer they operate on, because that's what determines whether two tools compete or compose.

7.1 Intent Layer: "What should be true?"

These tools turn fuzzy requirements into reviewable artifacts.

Tool	Maintainer	Shape	Best for
Spec Kit	GitHub	Comprehensive, multi-file (spec/plan/tasks/contracts/constitution)	Greenfield, large teams, strict specs
OpenSpec	Fission AI	Lightweight, change-centric (~4 artifacts)	Brownfield, fast iteration
Kiro	AWS	Agentic IDE, multimodal input	AWS/Claude users
BMAD-METHOD	Community	Multi-agent, role-simulating	Enterprise-scale complexity

The headline contrast: Spec Kit optimizes for completeness, OpenSpec optimizes for review cost. Spec Kit generates roughly 800 lines where OpenSpec generates roughly 250 for the same change. Whether that completeness is an asset or a tax depends entirely on your codebase, which is the whole point of this post.

7.2 Execution Layer: "Build it, and check yourself."

These don't replace the spec; they govern how the agent acts on it. Superpowers uses guided Q&A to clarify intent, then runs sub-agents behind a verification-before-completion gate. GSD manages context in waves for solo developers. HVE Core runs an RPI loop: Research, Plan, Implement, Review.

7.3 Orchestration Layer: "Coordinate many agents."

Squad coordinates parallel agents. BMAD-METHOD simulates a full agile team of specialized agents.

The takeaway: Intent, Execution, and Orchestration tools compose. You can pair OpenSpec (intent) with Superpowers (execution). Picking "the best SDD tool" is the wrong question; picking one tool per layer is the right one.

8. The Decision Filter

Here's the part the methodology evangelists skip: you should not always write a spec. The signal isn't team size or "best practice," it's the cost of ambiguity for this specific change.

Signal	Spec earns its keep	Spec is just ceremony
Blast radius	Touches many modules / public APIs	One file, contained
Reversibility	Hard to undo (migrations, schemas)	Trivial to revert
Ambiguity	Requirements genuinely unclear	You already know the exact diff
Audience	Others must review/maintain	Throwaway or solo-spike
Repetition	Pattern you'll repeat 10×	One-off

If most of your signals sit in the right column, the spec is the tax. Write the code.

A composite from the kind of work this filter is built for (details anonymized; treat it as illustrative, not a case study): a payments service had a settlement module nobody wanted to touch, the original authors long gone, behavior documented only by the tests that happened to pass. The task was to add a new payout currency. Every signal sat in the left column: blast radius across a dozen call sites, an irreversible ledger migration, requirements that turned out to mean three different things depending on who you asked, and a change the on-call team would own for years. The first instinct was to let the agent loose on it. The right move was the opposite. An hour spent writing down what "settled" actually meant, in EARS form, surfaced two contradictions between the rounding rules and the reconciliation job before a single line changed. The spec didn't slow the work down; it caught the bug that would have shipped. That is the left column earning its keep. The same agent, pointed at a one-line config flag the week before, would have produced nothing but a longer paper trail.

9. The Law of Surplus Structure

The claim, stated plainly: every artifact you add to an agent's context consumes reasoning budget, and if it doesn't reduce uncertainty, it's not governance, it's tax. This isn't a vibe; it's measurable from two independent directions.

Direction one, token cost. Jamie Telin ran OpenSpec against Spec Kit on the same task (streaming + session support for a chat app), twice, using GPT-5.2. The leaner framework won both times, and the gap was not small.

Measurement	OpenSpec	Spec Kit	Delta
Test 1, total tokens	~57,740	~120,947	+109%
Test 2, planning	38,117	96,298	+152%
Test 2, implementation	53,612	84,742	+58%
Test 2, total	91,729	181,040	+97%

More upfront structure nearly doubled total token usage without improving outcomes. OpenSpec also hit a higher success rate with roughly 20% fewer assistant turns and 25% fewer tool calls. (Source: Jamie Telin, "Spec Driven Development Is Wasting Tokens," Mar 2026.)

Direction two, a controlled study. A 2026 paper from ETH Zurich, Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? (Gloaguen, Mündler, Müller, Raychev, Vechev; arXiv, Feb 2026), tested the intuitive belief that handing an agent a structured repository overview helps it. They evaluated two settings: established SWE-bench tasks paired with LLM-generated context files written to the agent vendors' own recommendations, and a fresh collection of real-world issues drawn from repositories that already ship developer-written context files. The result cut against the intuition. Across multiple agents and models, context files reduced task success rates compared with giving the agent no repository context at all, while raising inference cost by over 20%.

Read that twice. Both the machine-written and the human-written files made outcomes worse on balance, not better, and they did it while costing more. The agents didn't ignore the files; they obeyed them, explored more broadly, ran more tests, traversed more files, and "thought" harder without producing better final patches. I call this failure mode the compliance loop trap: the agent spends its cognitive budget satisfying the structural guardrails instead of solving the problem, and the diligence is real but misdirected. The authors' own conclusion is the thesis of this entire post: unnecessary requirements from context files make tasks harder, and human-written context should describe only minimal requirements. Everything beyond that is surplus. This is the second tax I promised in Section 1: ambiguity is expensive, and so is its overcorrection.

10. Token Economics Is Architecture

If structure has a token price, then context budget is an architectural resource to be allocated, not spent reflexively. Treat it like memory in an embedded system.

Cost driver	Mitigation
Verbose, always-loaded specs	Load specs lazily, scoped to the task
Redundant restatement across artifacts	Single source of truth per fact; reference, don't repeat
Sub-agents rebuilding context	Pass distilled state, not full history
Multi-file divergence	State checkpoints: snapshot agreed truth before fan-out

The discipline: spend tokens where they reduce uncertainty, starve everything else.

11. EARS: Making Natural Language Less Ambiguous

If you're going to write requirements, write them in a form that resists misreading. EARS (Easy Approach to Requirements Syntax), developed by Mavin et al. at Rolls-Royce and presented at the IEEE Requirements Engineering conference (RE'09), constrains prose into a small set of patterns, and it's been adopted at Airbus, Bosch, Dyson, Honeywell, Intel, NASA, Rolls-Royce, and Siemens. The template:

While <optional pre-condition>, when <optional trigger>, the <system name> shall <system response>.

Before, the kind of requirement an agent will happily misinterpret:

The system should handle expired tokens gracefully and clean up sessions,
making sure not to leak any sensitive data.

What's "gracefully"? Clean up when? Leak to where? Each gap is a guess waiting to happen.

After, EARS-structured and unambiguous:

WHEN an identity token expires,
THE SYSTEM SHALL invalidate the active session cache within 500ms.

IF cache eviction fails,
THEN THE SYSTEM SHALL retry up to 3 times,
log a structured JSON error with a correlation ID,
and SHALL NOT persist plain-text PII in telemetry.

Same intent, zero room for creative interpretation. Note that EARS adds words but removes uncertainty, which is exactly the trade the Law of Surplus Structure says is worth making. Structure that reduces ambiguity isn't tax; structure that merely decorates is.

12. The Reality Check

Six failure modes I've watched SDD run into. None is a reason to abandon it; each is a reason to apply the decision filter.

Review overload. A spec that generates 800 lines of artifacts moves the bottleneck from writing code to reviewing specs. You haven't removed work, you've relocated it. If spec review is slower than the code review it replaced, the spec is tax.

False control. A detailed spec feels like control, but the agent can satisfy every line and still produce something wrong, because the spec encoded your misunderstanding faithfully. Precision is not correctness.

Spec/code drift. In spec-anchored workflows, the spec and code diverge the moment someone edits code directly and skips the spec. Now you have two sources of truth and no way to know which is right. Drift turns a contract back into a stale comment.

The multi-file divergence trap. When an agent fans out across many files, each can drift toward a different interpretation. State checkpoints, snapshotting agreed truth before parallel work, are the only reliable defense.

Natural language bottoms out. Even EARS can't make "intuitive UX" machine-precise. Some intent is irreducibly fuzzy, and pretending otherwise just produces confident wrong answers.

Spec-as-source repeats old risks. "Edit only the spec, regenerate the code" is the dream, but it reinvents the problems of code generation: opaque output, debugging a thing you didn't write, and trusting a compiler you can't fully inspect.

13. Adoption Strategy

Don't roll out SDD as a mandate. Roll it out where the ambiguity tax is highest, prove it, then expand.

Phase	Focus	Goal
Weeks 1 to 2	Pick one high-blast-radius, high-ambiguity workstream	Feel where specs earn their keep
Weeks 3 to 4	Add EARS for the requirements that bite	Reduce misinterpretation, measure review time
Month 2	Introduce one execution-layer tool (e.g., a verification gate)	Catch spec/code drift automatically
Month 3	Codify your own decision filter	Make "spec or skip?" a team reflex, not a ritual

The goal isn't "we do SDD now." It's "we know exactly when SDD pays, and we skip it when it doesn't."

Closing

Spec-driven development is not a methodology you adopt wholesale. It's a cost-management strategy for the two taxes that bracket every AI-assisted change: the ambiguity tax on the left, the surplus-structure tax on the right. Good engineering is finding the bottom of that curve, per change, not per team. So the rule is simple, and it's the whole post in one line:

Spec it when ambiguity is expensive. Skip it when the code is cheaper than the ceremony.