<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tisha</title>
    <description>The latest articles on DEV Community by Tisha (@tisha).</description>
    <link>https://dev.to/tisha</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960087%2F461d1521-802a-4dcb-b11e-7f2a7d88b7e6.png</url>
      <title>DEV Community: Tisha</title>
      <link>https://dev.to/tisha</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tisha"/>
    <language>en</language>
    <item>
      <title>Your Agent Failed in Prod. Good Luck Reproducing It.</title>
      <dc:creator>Tisha</dc:creator>
      <pubDate>Wed, 03 Jun 2026 10:16:43 +0000</pubDate>
      <link>https://dev.to/tisha/your-agent-failed-in-prod-good-luck-reproducing-it-56ci</link>
      <guid>https://dev.to/tisha/your-agent-failed-in-prod-good-luck-reproducing-it-56ci</guid>
      <description>&lt;h2&gt;
  
  
  9:04 a.m.
&lt;/h2&gt;

&lt;p&gt;A ticket lands. A customer ran your agent yesterday, it called the wrong tool, deleted the wrong record, and now there is a screenshot in your inbox with a red box drawn around the damage. You have the user ID. You have the timestamp. You copy the exact prompt out of the logs, paste it into the same model, with the same system prompt, and hit run.&lt;/p&gt;

&lt;p&gt;It works perfectly.&lt;/p&gt;

&lt;p&gt;You run it again. It works again. You run it ten more times. The agent behaves like a model employee every single time, and the one run that mattered, the one that cost a customer their data, is nowhere. You cannot make it happen again, which means you cannot debug it, which means you cannot promise it will not happen to the next customer.&lt;/p&gt;

&lt;p&gt;This is the reproducibility problem, and if you are shipping anything built on a large language model, it is already your problem. This post is about why it happens, why some of it is actually a feature you do not want to remove, and what you can do to get back the one thing you need: the ability to replay a run exactly as it happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  What "reproducible" even means here
&lt;/h2&gt;

&lt;p&gt;Most teams use the word to mean two different things and then argue past each other. Pull them apart and the whole topic gets clearer.&lt;/p&gt;

&lt;p&gt;The first meaning is &lt;strong&gt;bitwise determinism&lt;/strong&gt;: the same input always produces the identical output, token for token. This is what you assume you have with ordinary software and what you almost never have with an LLM.&lt;/p&gt;

&lt;p&gt;The second meaning is &lt;strong&gt;replayability&lt;/strong&gt;: given a run that already happened, you can reconstruct exactly what occurred, the inputs, the sampled outputs, the tool calls, the intermediate state, well enough to debug it. You do not need the model to be deterministic. You need the run to be recorded.&lt;/p&gt;

&lt;p&gt;The trap is chasing the first when you actually need the second. Teams spend weeks trying to force their model into bitwise determinism, fail, and conclude the system is unknowable. It is not. You were aiming at the wrong layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Temperature zero will not save you
&lt;/h2&gt;

&lt;p&gt;The first thing everyone tries is setting temperature to zero. The reasoning is clean. Temperature controls randomness in sampling. Set it to zero and the model must pick the single most probable next token every time, which is greedy decoding, which should be deterministic. One input, one output, forever.&lt;/p&gt;

&lt;p&gt;In theory, yes. In practice, run the same prompt twice at temperature zero and sooner or later the outputs diverge. It often starts with one word, the sentence takes a slightly different turn, and the rest drifts away from there. The reason is the distinction that fixes most of the confusion in this whole area, and it comes from Sara Zan's write up on the topic: &lt;strong&gt;sampling determinism is not the same thing as system determinism.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A quick piece of vocabulary, because it shows up everywhere from here on. Before the model emits a token, it produces a raw score for every candidate token in its vocabulary. Those scores are called &lt;strong&gt;logits&lt;/strong&gt;. Picking the token with the single highest logit is an operation called &lt;strong&gt;argmax&lt;/strong&gt;, literally "the argument that gives the maximum." Greedy decoding is just argmax at every step.&lt;/p&gt;

&lt;p&gt;So temperature zero makes the &lt;em&gt;selection rule&lt;/em&gt; deterministic. Always take the argmax. But it does nothing to guarantee that the logits you are taking the argmax over are identical from one run to the next. If two candidate tokens have logits that are almost tied, a difference in the last few bits is enough to swap which one wins, and once one token changes, every token after it is generated from a different prefix, so the divergence compounds.&lt;/p&gt;

&lt;p&gt;So the question becomes: why would the logits ever differ between two runs of the same model on the same input?&lt;/p&gt;




&lt;h2&gt;
  
  
  The original sin: floating point is not associative
&lt;/h2&gt;

&lt;p&gt;Here is the part that surprises people who have not stared at numerical code. With real numbers, addition is associative. &lt;code&gt;(a + b) + c&lt;/code&gt; equals &lt;code&gt;a + (b + c)&lt;/code&gt;. With floating point numbers it does not, because every intermediate result is rounded to finite precision. The canonical demonstration, from the Thinking Machines write up by Horace He and collaborators:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(0.1 + 1e20) - 1e20  =  0
0.1 + (1e20 - 1e20)  =  0.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same three numbers, different grouping, different answer. This is not a bug. It is the price floating point pays for representing both enormous and tiny values with a constant number of significant figures.&lt;/p&gt;

&lt;p&gt;Now scale that up. A transformer forward pass, one full run of the model over the input, is millions of additions, multiplications, and reductions across matrix multiplications, normalizations, and attention. Change the order in which any of those reductions accumulate and you change the last few bits of the result. Change the last few bits of a logit and you can change which token is the argmax. That is the chain from low level arithmetic all the way up to a different sentence.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real culprit is not the one everyone names
&lt;/h2&gt;

&lt;p&gt;The common explanation stops at floating point plus concurrency. In one line: thousands of GPU threads finish in an order nobody controls, and because floating point addition is not associative, adding the same numbers in a different order gives a slightly different sum, so the output wobbles from run to run. It sounds complete. It is wrong, and the Thinking Machines analysis is the clearest debunking of it.&lt;/p&gt;

&lt;p&gt;Here is the inconvenient fact that breaks the popular story. Run the same matrix multiplication on the same GPU on the same data a thousand times and you get bitwise identical results every single time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;assert &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Floating point is in play. Massive concurrency is in play. And yet the result is perfectly reproducible. So concurrency plus floating point cannot be the whole answer.&lt;/p&gt;

&lt;p&gt;The true culprit is &lt;strong&gt;batch invariance&lt;/strong&gt;, or rather the lack of it. Production inference servers do not run your request alone. They batch it together with whatever other requests happen to arrive at the same moment, for efficiency. The kernels, the low level GPU routines that compute your output, run reductions inside normalization, matrix multiplication, and attention whose results depend on the &lt;em&gt;shape of the batch&lt;/em&gt; they ran in. The forward pass is deterministic for a fixed batch. But the batch is not fixed. It depends on concurrent load, on who else is hitting the server in the same millisecond, on conditions you do not control and cannot see.&lt;/p&gt;

&lt;p&gt;So your prompt is identical, your parameters are identical, and the thing that changed is the company you were keeping inside the server. This is also why a prompt looks rock solid in local testing and turns flaky in production. The model did not get more creative. The batching conditions changed.&lt;/p&gt;

&lt;p&gt;The Thinking Machines team showed both the scale of the problem and the fix. Running standard vLLM, a thousand identical prompts to Qwen-3-8B produced eighty distinct completions. With batch invariant kernels, the ones that produce the same result regardless of batch shape, the same thousand prompts produced exactly one. The cost was real but modest, one of their tests went from twenty six seconds to forty two. Their library, batch-invariant-ops, has since been picked up by SGLang. The three operations that have to be made batch invariant are RMSNorm (a normalization step), matrix multiplication, and attention.&lt;/p&gt;

&lt;p&gt;The lesson: true bitwise reproducibility is achievable, but only by controlling the entire inference stack down to the kernels. Almost no one calling a hosted API has that control.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mixture of experts adds another door
&lt;/h2&gt;

&lt;p&gt;In one line: a mixture of experts model is one large network split into many smaller specialist subnetworks, with a router that sends each token to only a few of them instead of running the whole model every time. Many frontier models are built this way, and the architecture is a second independent source of the same problem. If that routing were per token and independent, it would be deterministic. It is not, and the reason is a number called the capacity factor.&lt;/p&gt;

&lt;p&gt;Each expert can only process so many tokens in a given batch. That ceiling is the capacity factor: a threshold on how many tokens one expert will accept before it is full. When too many tokens in a batch all want the same expert, the ones over the limit cannot all be served. The overflow tokens get bumped to their second choice expert, or dropped from that layer entirely. So whether your token reaches its first choice expert depends on how many other tokens in the same batch were competing for it.&lt;/p&gt;

&lt;p&gt;That is the same trap as batch invariance, wearing a different costume. The routing decision for your token is not a function of your token alone. It is a function of the whole batch your token landed in. As Vincent Schmalbach lays out, this makes a mixture of experts model deterministic at the batch level and nondeterministic at the level of a single sequence. Send the same prompt twice, get it batched with different neighbors, and the capacity math resolves differently, so your tokens route differently. Same root cause, a second mechanism delivering it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The full 360: everything that moves under you
&lt;/h2&gt;

&lt;p&gt;Sampling and kernels are only the inference layer, and they are just two of about eight things that moved under you between yesterday's run and today's. In a real agent they are often among the most stable. Here are the other six, and every one of them can change the output even if the model itself were frozen solid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The prompt is rarely fixed.&lt;/strong&gt; Interpolate the date, the user's name, a feature flag, or a sampled few shot example, and the "same" prompt is not the same prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The context is assembled at runtime.&lt;/strong&gt; Retrieval pulls from an index that updates continuously, so yesterday's chunks are not today's chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools return live data.&lt;/strong&gt; A weather call, a database read, a search API each return something different every time, and the model reasoned over a world state you did not capture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time leaks in.&lt;/strong&gt; "Schedule it for next Tuesday" resolves to a different date depending on when it runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model version drifts.&lt;/strong&gt; The &lt;code&gt;gpt-4o&lt;/code&gt; or &lt;code&gt;claude&lt;/code&gt; you called last month may be a different set of weights this month, with no version bump you controlled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conversation history accumulates.&lt;/strong&gt; In a multi turn agent, earlier turns are part of the input, so if any one of them varied, every later turn inherits it.&lt;/p&gt;

&lt;p&gt;This is the part most reproducibility discussions miss by staring only at temperature. The sampler is one knob on a machine with eight. To reproduce a run you have to pin all eight, not just the one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wait: we actually want some of this
&lt;/h2&gt;

&lt;p&gt;Before going further, we have to be honest about something, because the obvious reaction to everything above is "fine, make it all deterministic and be done." Do not. If you could flip one switch and make your model perfectly deterministic, token for token, forever, you should not flip it. The nondeterminism that wrecks your reproducibility is the same property that makes the model good. The argument has four parts, and most teams only know the first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality: greedy decoding is not safe, it is broken.&lt;/strong&gt; The intuition is that always taking the single most probable token is the careful, conservative choice. It is not. Holtzman and colleagues showed in their 2020 nucleus sampling work that maximization based decoding, greedy and beam search, drives open ended generation into bland, repetitive, looping text that humans immediately recognize as machine written. Their conclusion was blunt: maximization is the wrong objective for open ended text generation. The fix is to sample, but only from the reliable head of the distribution, truncating the unreliable tail. That is nucleus sampling, the top-p knob, usually set around 0.95. The variation is not decoration. Switch it off and the prose collapses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The knobs, named.&lt;/strong&gt; When we say variation we mean three specific controls. Temperature reshapes the distribution before sampling: low values near 0.2 make it peaky and conservative, higher values near 0.8 to 1.0 flatten it and admit more surprising tokens. Top-k restricts sampling to the k most likely tokens (Fan and colleagues). Top-p, nucleus sampling, restricts it to the smallest set of tokens whose probability mass exceeds p (Holtzman and colleagues). These are the levers. Everything downstream is a consequence of how you set them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accuracy: sampling can make the model more correct, not less.&lt;/strong&gt; This is the part that converts skeptics, because it is a number rather than a preference. Self consistency (Wang and colleagues, ICLR 2023) throws away the single greedy answer entirely. It samples many diverse reasoning paths, around forty, at temperature 0.7 with top-k 40, then takes the majority vote over the final answers. The gains are large and consistent: plus 17.9 percent on GSM8K, plus 11.0 on SVAMP, plus 12.2 on AQuA, plus 6.4 on StrategyQA, plus 3.9 on ARC challenge. The mechanism is the one that makes random forests beat a single decision tree. Diverse samples, aggregated, beat one confident guess. Determinism would have handed you exactly one path, and a worse answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exploration: agents need to try things.&lt;/strong&gt; Anything that searches depends on variation. Best of N sampling generates many candidate completions and keeps the best under some scorer, and coverage, the chance that at least one of N samples is correct, climbs with N only because the samples differ. Agent loops that retry a failed tool call, propose alternative plans, or branch are running the same exploration versus exploitation tradeoff that reinforcement learning has always lived on. A perfectly deterministic agent retries the identical failing action forever. Variation is what lets it escape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Discovery: sampling has found things humans had not.&lt;/strong&gt; The strongest version of the argument is no longer hypothetical. DeepMind's FunSearch (Nature, 2023) paired a pretrained LLM with an automated evaluator in an evolutionary loop: sample candidate programs, keep the ones that score, mutate those, repeat. It solved the cap set problem in extremal combinatorics, a question Terence Tao had called a favorite open problem, producing the first new discovery by an LLM on a problem of that difficulty, in collaboration with Prof. Jordan Ellenberg. Its successor AlphaEvolve (2025) used an ensemble of Gemini models as mutation operators to evolve entire codebases, and the results shipped. A data center scheduling heuristic that recovers on average 0.7 percent of Google's worldwide compute and has run in production for over a year. A matrix multiplication kernel sped up 23 percent that cut Gemini's own training time by 1 percent. A procedure to multiply two four by four complex matrices in 48 scalar multiplications, the first improvement over Strassen's algorithm in that setting in 56 years. A later study with Tao and collaborators ran it across 67 problems in analysis, combinatorics, geometry, and number theory. None of that happens with the temperature pinned to zero. The diversity of the samples is the search.&lt;/p&gt;




&lt;h2&gt;
  
  
  The reconciliation
&lt;/h2&gt;

&lt;p&gt;So which do we want, variation or determinism? Both, and the reason they do not contradict is that they live at different layers.&lt;/p&gt;

&lt;p&gt;We want variation at &lt;strong&gt;generation&lt;/strong&gt; time, because that is where quality, accuracy, exploration, and discovery come from. We want determinism at &lt;strong&gt;replay&lt;/strong&gt; time, because that is where debugging, regression testing, and incident response come from.&lt;/p&gt;

&lt;p&gt;The mistake teams make is trying to buy reproducibility by killing generation time variation. That is the wrong layer, and it costs you everything in the four sections above. You do not freeze the model. You capture what it did. The inputs, the sampled outputs, the tool calls, the retrieved context, the model version, the timestamp. Then you replay the captured run, not a fresh generation. Keep the creativity. Record the evidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Record and replay
&lt;/h2&gt;

&lt;p&gt;The technique that resolves the whole tension is borrowed from a decades old idea in software testing: record the real interaction once, replay it forever. There are three distinct jobs it does, and it helps to keep them separate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post mortem debugging.&lt;/strong&gt; When the 9:04 ticket arrives, you do not re run the model and hope. You pull the recorded run: the exact assembled prompt, the exact sampled completion, the exact tool inputs and outputs, the retrieved chunks, the model version string. Now the bad run is in front of you, frozen, and you can actually trace what happened. This is the capability you were missing in the opening story.&lt;/p&gt;

&lt;p&gt;Concretely, the recording is one envelope per run. Capture it on the way out, so the agent writes its own black box recorder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;now_iso&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;resolved_model_version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# not the floating alias
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_prompt_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# the assembled prompt
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# name + args, as sent
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the ticket lands, that envelope is the whole crime scene, frozen:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"run_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a3f9c1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"clean up the inactive accounts in staging"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"retrieved_chunks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"runbook_staging_cleanup"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool_calls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"delete_accounts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"status = inactive"&lt;/span&gt;&lt;span class="p"&gt;}}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"completion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Done, I cleared the inactive accounts."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now look at what the envelope rules out. The user said staging. The retrieved chunk &lt;code&gt;runbook_staging_cleanup&lt;/code&gt; is the correct runbook and it says staging. The assembled prompt is clean. And yet the tool call went to &lt;code&gt;production&lt;/code&gt;. Nothing in the context explains the swap, and that is the whole point. The retrieval was right and the prompt was right, so the failure did not live in your data pipeline. It lived in generation. Your request was batched with sixty four others that millisecond, two candidate tokens for that argument sat almost tied, one logit crossed its neighbor, and &lt;code&gt;production&lt;/code&gt; won where &lt;code&gt;staging&lt;/code&gt; should have. Replay the same prompt alone and it behaves, because the batch that tipped it is gone. The envelope is what lets you say that with confidence: the inputs were perfect, so stop grepping your retriever and go read the sampler. This is the failure the first half of this post was about, caught in the act.&lt;/p&gt;

&lt;p&gt;Capturing all of this in production is not free, and that is the honest tension. Recording every run means writing the assembled prompt, the retrieved chunks, every tool input and output, and the model version to durable storage on the hot path, which costs storage and adds a little latency to each request. Those payloads also carry whatever the user typed and whatever your retriever pulled, which in most enterprises means customer PII headed for durable storage, so a deterministic redaction pass has to scrub the envelope before it ever reaches the recorder, not after it lands. Open instrumentation standards like OpenInference, and tracing backends like Phoenix, exist to make this routine: they capture the spans of an agent run as structured telemetry and stream the payloads to a data store you can query later. The practical move is to record the full envelope for everything in production but down sample or expire it, keep every run for a few days so the 9:04 ticket is always answerable, and keep the interesting runs, the failures, the flagged ones, forever. The same envelopes you captured in production are what you replay in CI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Experience reuse.&lt;/strong&gt; A recorded run is also a cache. If the same inputs come around again, you can serve the recorded output instead of paying for another generation, which is faster and free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic CI.&lt;/strong&gt; CI is the automated test suite that runs every time someone pushes code, and you want it deterministic, meaning the same code always gives the same pass or fail instead of flaking at random. This is where most teams adopt record and replay first, and the motivation is brutal and practical. The Learnixo write up names the three problems precisely. Cost: every real call to a hosted model during CI burns budget, and at fifty developers times ten pull requests times twenty tests, it adds up fast. Non determinism: a test that asserts the output equals an exact string fails most of the time, because the model does not return the same string twice. Latency: a real call takes two to ten seconds, so a suite with thirty of them takes minutes, which kills the fast feedback loop that makes CI worth having.&lt;/p&gt;

&lt;p&gt;Record and replay fixes all three at once. Record the real responses once, replay them on every subsequent run. Tests become free, deterministic, and fast.&lt;/p&gt;




&lt;h2&gt;
  
  
  The trap in the fix cycle
&lt;/h2&gt;

&lt;p&gt;There is a catch that a sharp reviewer will find in about ten seconds, so let us find it first. Record and replay is a superb post mortem tool. It is a bad fix verification tool, and the reason is the same nondeterminism we have been chasing the whole way down.&lt;/p&gt;

&lt;p&gt;Walk the loop. The 9:04 envelope tells you the agent emitted &lt;code&gt;production&lt;/code&gt; where it meant &lt;code&gt;staging&lt;/code&gt;. You write a fix: a tighter system prompt, a guard on the tool, a reworded instruction. Now you want to prove the fix works. But the moment you change the prompt, the input hash changes, so the recorded run no longer matches and your replay is a cache miss. A miss falls through to a live call, and a live call is back in the land of batching and logit flips, the exact thing you could not reproduce in the first place. Even with the input held byte for byte identical, regenerating re batches your request with whoever else is on the server, so the flip you are trying to squash may simply not fire today. You cannot confirm a fix by replaying the model, because replaying the model is not deterministic and a fix by definition changes the input.&lt;/p&gt;

&lt;p&gt;The way out is to stop asking record and replay to do a job it cannot, and to split testing into two layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer one, exact match replay for control flow.&lt;/strong&gt; Freeze the captured context as a fixture and assert on structure, not prose. Given this exact prompt and these exact retrieved chunks, does the agent take the same path, call the same tool, with an argument of the right shape and the right &lt;code&gt;target&lt;/code&gt;? This layer is deterministic and free because it never calls the model. It catches the regression that matters most here: the guard you added must make the destructive &lt;code&gt;target&lt;/code&gt; impossible, and a frozen fixture proves it without a single live token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer two, semantic judgement for the parts that are allowed to change.&lt;/strong&gt; When the thing you changed is the wording of a prompt or the model version, bitwise equality is the wrong assertion, because the whole point of the change is that the text will differ. Here you run the candidate against the recorded context and score the output with an evaluator, an LLM as a judge, that asks whether the new answer means the same thing as the old one rather than whether it matches word for word: did the answer stay grounded in the chunk, did it refuse the destructive call, did it preserve the meaning of the gold response. The recorded envelope becomes the regression fixture, the judge accepts any output that means the right thing.&lt;/p&gt;

&lt;p&gt;That is the loop closed. The envelope you captured in production verifies structure deterministically and meaning semantically, and neither layer asks a nondeterministic system to repeat itself on command.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to actually do it in your test suite
&lt;/h2&gt;

&lt;p&gt;There is a small ecosystem for this in Python, and the right answer is a layered strategy rather than a single tool. But first, a warning about which layer you record at, because the obvious one is the wrong one.&lt;/p&gt;

&lt;p&gt;The instinct is to mock the network: intercept the HTTP call to the model and replay the bytes. For a single synchronous request that works. For a real agent it breaks, and it breaks in exactly the conditions you ship in. Token streaming with &lt;code&gt;stream=True&lt;/code&gt; turns one response into a long lived chunked transfer that network cassettes mangle. Concurrent &lt;code&gt;asyncio&lt;/code&gt; event loops interleave several model calls over the same connection. HTTP/2 multiplexing carries multiple requests down one socket at once. Record at the socket and you are trying to freeze a river.&lt;/p&gt;

&lt;p&gt;Record one level up instead. Mock at the framework or orchestrator boundary, the provider your agent calls through, and override the step function of the agent loop rather than the network underneath it. Call it deterministic graph state hydration: you are capturing the internal state transitions of the execution graph, the prompt that entered a node and the structured output that left it, not the raw packets in between. This is the difference any good review will probe, raw network payloads versus the internal state machine of the agent, and the agent state is the layer that actually replays cleanly. The tools below sit at different points on that spectrum, and the first thing to know about each is which layer it records at, because one of the most popular ones records at exactly the layer this section just told you to avoid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VCR style cassettes, the legacy layer.&lt;/strong&gt; The oldest approach in this family is VCR.py with the pytest-recording plugin, and it is worth being precise about where it sits, because it is the socket level recorder this section just warned you about. VCR.py works by monkeypatching Python's low level HTTP machinery, urllib3, aiohttp, the socket calls underneath your client, and taping the bytes that cross the wire into a YAML "cassette" on first run, then replaying those bytes on every run after. You mark a test and forget about it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.mark.vcr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_agent_response&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_agent_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain recursion in one sentence.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recursion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First run hits the real API and writes the cassette. Every run after reads from the cassette, no network, no cost, identical bytes. The one thing you must not skip: the default cassette captures your Authorization header and API key in plaintext. Redact them in your config before anything touches version control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;module&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;vcr_config&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filter_headers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DUMMY_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is genuinely useful for the narrow case VCR was built for: a single synchronous request to a simple REST shaped endpoint, where the wire bytes and the logical call are the same thing. It is also the layer that breaks on streaming, concurrent event loops, and HTTP/2 multiplexing, which describes most real agents. So treat VCR as the historical default, the thing teams reached for before agents grew orchestration layers, not as the place to record an agent loop.&lt;/p&gt;

&lt;p&gt;The graph boundary equivalent is to let your framework hand you the state it already tracks, so you never touch a socket. LangGraph checkpointers persist the state at each node transition, so you can freeze the input that entered a node and the output it produced and replay that pair directly. LlamaIndex workflows expose the same idea through their event stream, every step's input and output as a structured object you can capture and feed back. And when you have rolled your own orchestration, the move is a mock at the provider seam, the one function your agent calls to reach the model, returning recorded structured outputs keyed by the canonicalized request. All three record the meaning of a step rather than the packets that carried it, which is the property that survives streaming and concurrency. That is the true graph boundary hydration the rest of this section is built on, and it is why the cleaner patterns below all mock above the wire, not on it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exact match fixture replay.&lt;/strong&gt; A lighter weight pattern, shown by the llm-fixture-replay library, stores each request and response pair as one line of JSONL, keyed by a SHA-256 hash of the canonicalized request. Replay looks for an exact match. Change the model, the messages, or any parameter and it is a miss, which is exactly what you want, because a changed input &lt;em&gt;should&lt;/em&gt; invalidate the recording. Auto mode replays on a hit and records on a miss, so a new test extends the fixture file automatically, and committing that file makes every later run fully offline.&lt;/p&gt;

&lt;p&gt;The whole core is about ten lines. Hash the call arguments, look for that key in the fixture, replay on a hit, and call the real function only on a miss:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# replay on hit
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# record on miss
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sort the keys before hashing so the same request always lands on the same key. That one line is what makes the lookup stable across runs.&lt;/p&gt;

&lt;p&gt;Sorting the keys is necessary but not sufficient, and this is where the pattern quietly fails on real agents. &lt;code&gt;json.dumps(kwargs, default=str)&lt;/code&gt; is stable for a flat dict of strings. Point it at an agent state full of nested Pydantic models, datetimes, and system objects and &lt;code&gt;default=str&lt;/code&gt; will happily serialize a timestamp, an object id, or a memory address that is different on every run, so the same logical request hashes to a new key each time and every lookup misses. The fix is semantic canonicalization before you hash: strip the transient metadata that has no business in the key, the timestamps, the trace ids, the run ids, stabilize whitespace, and recursively sort nested structures so two equivalent states produce one canonical form. Hash the meaning of the request, not its incidental wire encoding. Without that step your fixture grows a new entry every run and replays nothing.&lt;/p&gt;

&lt;p&gt;Concretely, canonicalization is a recursive pass that normalizes the types you recognize and refuses the ones you do not, so an opaque object becomes a loud failure you fix rather than a silent memory address that poisons the key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;TRANSIENT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trace_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;canonicalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;                 &lt;span class="c1"&gt;# stable string, not the clock object
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;canonicalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TRANSIENT&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;            &lt;span class="c1"&gt;# drop transient keys, then sort
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;canonicalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;))):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Unserializable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;            &lt;span class="c1"&gt;# opaque object: fix it, never str() it
&lt;/span&gt;
&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;canonicalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference from &lt;code&gt;default=str&lt;/code&gt; is the final line. &lt;code&gt;default=str&lt;/code&gt; says yes to everything, including the object whose repr changes every run, so the instability slips into the key unnoticed. Canonicalization refuses what it cannot stabilize, and that refusal is what forces the key to stay constant across runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero config mocks.&lt;/strong&gt; When you do not even want a real response, a mock library like pytest-mockllm gives you a fixture that returns whatever you tell it, with no API key and no setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mock_openai&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;mock_openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I can help with your order.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;my_chatbot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I need help&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;mock_openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_count&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The layering.&lt;/strong&gt; Use mocks for unit tests, where you are testing your own control flow and the model's content is irrelevant. Use recorded fixtures for integration tests, where you want realistic model output but deterministic and free. Keep a small number of live tests, but run them on a schedule rather than on every commit, because they do a job the fixtures structurally cannot. Record and replay buys you a deterministic CI pipeline, but it also blinds that pipeline to the one failure that originates upstream: a provider silently changing the weights behind a stable alias. An exact match fixture sails through that change, because it never makes the call and faithfully replays yesterday's cached response, while production breaks the instant the real model shifts. A scheduled live canary, a handful of real calls run nightly against the pinned alias, is the only thing watching that seam, and it is also where a prompt change that silently degrades quality finally shows up. One more practical move from the Learnixo playbook: swap the real client for a mock behind an environment flag and a dependency injection point, so the same code path runs both ways and you flip it per environment.&lt;/p&gt;




&lt;h2&gt;
  
  
  A playbook you can apply on Monday
&lt;/h2&gt;

&lt;p&gt;Pulling it all together into something actionable.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stop chasing bitwise determinism through the API.&lt;/strong&gt; Unless you own the inference stack down to the kernels, you cannot get it, and you would not want to pay its quality cost if you could.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pin everything you can pin.&lt;/strong&gt; Pin the model version explicitly rather than trusting a floating alias. Log it with every call so you know when it drifts under you.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture the full run, not just the prompt.&lt;/strong&gt; Record the assembled prompt, the parameters, the sampled output, every tool input and output, every retrieved chunk, and the timestamp. The model is one of eight moving parts. Record all eight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay for debugging, do not regenerate.&lt;/strong&gt; When something breaks, reconstruct the frozen run from the recording. A fresh generation is a different run and tells you nothing about the one that failed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Record at the graph boundary, and test in two layers.&lt;/strong&gt; Capture the state transitions of the agent loop, not raw sockets, so streaming and concurrency cannot corrupt the recording. Then split your suite: exact match replays on the frozen context for control flow, and an LLM as a judge scoring semantic equivalence for anything whose wording is allowed to change. The first layer proves structure, the second verifies a fix without asking a nondeterministic model to repeat itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep generation time variation alive.&lt;/strong&gt; Do not let the pursuit of reproducible tests push you into greedy decoding in production. Determinism belongs in replay, not in generation.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What is still unsolved
&lt;/h2&gt;

&lt;p&gt;It would be dishonest to end on a tidy bow. Several hard parts remain open.&lt;/p&gt;

&lt;p&gt;Batch invariant kernels exist but are not the default, and they cost throughput, so most hosted providers do not run them and you cannot make them. Recording the full context of an agent run is straightforward in principle and tedious in practice, and the more tools and retrieval your agent uses, the more surface there is to capture and the easier it is to miss one. Model version drift on hosted endpoints is largely outside your control, and a provider can change the weights under a stable name. And there is a genuine philosophical tension we did not resolve so much as relocate: the field is actively building systems whose value &lt;em&gt;comes from&lt;/em&gt; exploring nondeterministically, while simultaneously needing those same systems to be auditable and reproducible. Those two goals pull in opposite directions, and the layered answer, vary in generation, freeze in replay, is the best current reconciliation, not a final one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Back to 9:04
&lt;/h2&gt;

&lt;p&gt;The ticket that opened this post is unanswerable in a world where you can only re run the model and watch it behave. It becomes routine in a world where every run was recorded. The customer's run is right there, frozen, with the prompt and the tool calls and the retrieved context exactly as they were. You see the agent call the wrong tool, you see the input that led it there, and you write the fix.&lt;/p&gt;

&lt;p&gt;You did not make the model deterministic. You never needed to. You made the run reproducible, which is the only thing you needed all along.&lt;/p&gt;




&lt;h3&gt;
  
  
  Sources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Holtzman, Buys, Du, Forbes, Choi. "The Curious Case of Neural Text Degeneration." ICLR 2020. arXiv:1904.09751.&lt;/li&gt;
&lt;li&gt;Wang, Wei, Schuurmans, Le, Chi, Narang, Chowdhery, Zhou. "Self-Consistency Improves Chain of Thought Reasoning." ICLR 2023. arXiv:2203.11171.&lt;/li&gt;
&lt;li&gt;He and collaborators. "Defeating Nondeterminism in LLM Inference." Thinking Machines Lab, Sep 10 2025.&lt;/li&gt;
&lt;li&gt;Zan. "Setting the temperature to zero will make an LLM deterministic?" Mar 24 2026.&lt;/li&gt;
&lt;li&gt;Romera-Paredes and colleagues. "FunSearch." Nature, 2023.&lt;/li&gt;
&lt;li&gt;AlphaEvolve white paper, DeepMind, 2025.&lt;/li&gt;
&lt;li&gt;Georgiev, Gomez-Serrano, Tao, Wagner. "Mathematical exploration and discovery at scale." arXiv:2511.02864.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>testing</category>
    </item>
    <item>
      <title>Spec-Driven Development: When Structure Helps and When It Becomes Tax</title>
      <dc:creator>Tisha</dc:creator>
      <pubDate>Mon, 01 Jun 2026 12:01:02 +0000</pubDate>
      <link>https://dev.to/tisha/spec-driven-development-when-structure-helps-and-when-it-becomes-tax-1f66</link>
      <guid>https://dev.to/tisha/spec-driven-development-when-structure-helps-and-when-it-becomes-tax-1f66</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; I work at Microsoft. The views here are my own, and I've kept the tool comparisons evidence-based.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. The Ambiguity Tax
&lt;/h2&gt;

&lt;p&gt;Every vague requirement you hand an AI coding agent gets paid for later: in rework, in drift, in three files that each solved a slightly different version of the problem you never fully stated. I call this the &lt;strong&gt;ambiguity tax&lt;/strong&gt;, the compounding cost of letting an automated loop run on under-specified intent. A human engineer fills gaps with judgment and a quick Slack message; an agent fills them with confident guesses and then builds on those guesses at machine speed. By the time you read the diff, the misunderstanding is load-bearing.&lt;/p&gt;

&lt;p&gt;Spec-driven development (SDD) is, at its core, a strategy for paying this tax up front when it's cheap, instead of at review time when it's expensive. But there's a second tax most SDD advocates never mention, and it's the more interesting one.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. First, Define the Artifact
&lt;/h2&gt;

&lt;p&gt;Before the philosophy, the noun. A &lt;strong&gt;spec&lt;/strong&gt;, in this context, is not a Word document handed down from a product manager. It's a &lt;strong&gt;versioned, reviewable artifact that carries engineering intent into the agent's context&lt;/strong&gt;: a file (or set of files) that lives in the repo, moves through code review, and constrains what the agent generates. That's the whole shift. Intent moves out of ephemeral chat history and into something you can diff, comment on, and roll back.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. What SDD Actually Means
&lt;/h2&gt;

&lt;p&gt;Spec-driven development is the practice of making the spec, not the conversation, the primary unit of engineering work when collaborating with an AI agent. Instead of "prompt, code, fix, prompt again," you get "spec, plan, tasks, code, verify against spec." The artifact is the source of truth and the chat is just how you edit it. This sounds like a pure win. It isn't, which brings us to the tradeoff.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Core Tradeoff
&lt;/h2&gt;

&lt;p&gt;SDD lives between two failure modes. Too little structure produces the ambiguity tax: the agent guesses, drifts, and fragments. Too much structure produces what I'll call the &lt;strong&gt;Law of Surplus Structure&lt;/strong&gt;: every extra rule consumes the agent's finite reasoning budget, whether or not it reduces uncertainty. The entire craft of SDD is finding the floor of that curve, enough structure to kill ambiguity, not so much that you're burning tokens to enforce ceremony. Hold that U-shape in your head; everything below is about locating its bottom.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5u75eg0nb1yc9nk7hn8i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5u75eg0nb1yc9nk7hn8i.png" alt="The cost of structure is U-shaped: ambiguity cost falls as you add structure, surplus-structure cost rises, and total cost bottoms out at a sweet spot in between." width="800" height="494"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The picture is the whole argument. Ambiguity cost falls fast as you add the first bits of structure, then flattens. Surplus-structure cost starts near zero and climbs as ceremony piles up. Total cost is their sum, and it bottoms out well before "maximum structure." Everything past that minimum is you paying to make the agent dumber.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Taxonomy: Three Levels of SDD
&lt;/h2&gt;

&lt;p&gt;Birgitta Böckeler's framing is the cleanest I've found: SDD isn't one thing, it's three levels of commitment.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;What persists&lt;/th&gt;
&lt;th&gt;Who edits what&lt;/th&gt;
&lt;th&gt;The spec is…&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spec-first&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Code. Spec is scaffolding.&lt;/td&gt;
&lt;td&gt;You edit code after generation.&lt;/td&gt;
&lt;td&gt;A starting prompt you discard.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spec-anchored&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spec &lt;strong&gt;and&lt;/strong&gt; code, kept in sync.&lt;/td&gt;
&lt;td&gt;You edit both; spec is reviewed.&lt;/td&gt;
&lt;td&gt;A durable contract.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spec-as-source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spec only. Code is a build output.&lt;/td&gt;
&lt;td&gt;You edit &lt;em&gt;only&lt;/em&gt; the spec.&lt;/td&gt;
&lt;td&gt;The source of truth; code is compiled from it.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most teams think they're doing spec-anchored. Most are actually doing spec-first with extra steps: they write a spec, generate from it, then never touch it again. That's fine, as long as you're honest that the spec was a prompt, not a contract.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. The Canonical Lifecycle Loop
&lt;/h2&gt;

&lt;p&gt;Strip away the tool branding and nearly every SDD workflow is the same six-stage loop.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Question it answers&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Explore&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What exists? What's the terrain?&lt;/td&gt;
&lt;td&gt;Shared understanding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Specify&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What should be true when we're done?&lt;/td&gt;
&lt;td&gt;The spec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Plan&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How will we get there?&lt;/td&gt;
&lt;td&gt;Technical approach&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tasks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What are the discrete steps?&lt;/td&gt;
&lt;td&gt;Ordered work items&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Implement&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Build it.&lt;/td&gt;
&lt;td&gt;Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Verify&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does it match the spec?&lt;/td&gt;
&lt;td&gt;Pass/fail + evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Tools differ mostly in which stages they automate, which they force you to do explicitly, and how much each artifact weighs.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. The Ecosystem, Reframed by Architecture
&lt;/h2&gt;

&lt;p&gt;Most SDD tool round-ups list features. More useful is to sort tools by which architectural layer they operate on, because that's what determines whether two tools compete or compose.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.1 Intent Layer: "What should be true?"
&lt;/h3&gt;

&lt;p&gt;These tools turn fuzzy requirements into reviewable artifacts.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Maintainer&lt;/th&gt;
&lt;th&gt;Shape&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;&lt;strong&gt;Spec Kit&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;GitHub&lt;/td&gt;
&lt;td&gt;Comprehensive, multi-file (spec/plan/tasks/contracts/constitution)&lt;/td&gt;
&lt;td&gt;Greenfield, large teams, strict specs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Fission-AI/OpenSpec" rel="noopener noreferrer"&gt;&lt;strong&gt;OpenSpec&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Fission AI&lt;/td&gt;
&lt;td&gt;Lightweight, change-centric (~4 artifacts)&lt;/td&gt;
&lt;td&gt;Brownfield, fast iteration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;&lt;strong&gt;Kiro&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;td&gt;Agentic IDE, multimodal input&lt;/td&gt;
&lt;td&gt;AWS/Claude users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/bmad-code-org/BMAD-METHOD" rel="noopener noreferrer"&gt;&lt;strong&gt;BMAD-METHOD&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Community&lt;/td&gt;
&lt;td&gt;Multi-agent, role-simulating&lt;/td&gt;
&lt;td&gt;Enterprise-scale complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The headline contrast: Spec Kit optimizes for completeness, OpenSpec optimizes for review cost. Spec Kit generates roughly 800 lines where OpenSpec generates roughly 250 for the same change. Whether that completeness is an asset or a tax depends entirely on your codebase, which is the whole point of this post.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.2 Execution Layer: "Build it, and check yourself."
&lt;/h3&gt;

&lt;p&gt;These don't replace the spec; they govern how the agent acts on it. &lt;a href="https://github.com/obra/superpowers" rel="noopener noreferrer"&gt;&lt;strong&gt;Superpowers&lt;/strong&gt;&lt;/a&gt; uses guided Q&amp;amp;A to clarify intent, then runs sub-agents behind a verification-before-completion gate. &lt;a href="https://github.com/gsd-build/get-shit-done" rel="noopener noreferrer"&gt;&lt;strong&gt;GSD&lt;/strong&gt;&lt;/a&gt; manages context in waves for solo developers. &lt;a href="https://microsoft.github.io/hve-core/" rel="noopener noreferrer"&gt;&lt;strong&gt;HVE Core&lt;/strong&gt;&lt;/a&gt; runs an RPI loop: Research, Plan, Implement, Review.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.3 Orchestration Layer: "Coordinate many agents."
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/bradygaster/squad" rel="noopener noreferrer"&gt;&lt;strong&gt;Squad&lt;/strong&gt;&lt;/a&gt; coordinates parallel agents. &lt;a href="https://github.com/bmad-code-org/BMAD-METHOD" rel="noopener noreferrer"&gt;&lt;strong&gt;BMAD-METHOD&lt;/strong&gt;&lt;/a&gt; simulates a full agile team of specialized agents.&lt;/p&gt;

&lt;p&gt;The takeaway: Intent, Execution, and Orchestration tools compose. You can pair OpenSpec (intent) with Superpowers (execution). Picking "the best SDD tool" is the wrong question; picking one tool per layer is the right one.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. The Decision Filter
&lt;/h2&gt;

&lt;p&gt;Here's the part the methodology evangelists skip: you should not always write a spec. The signal isn't team size or "best practice," it's the cost of ambiguity for this specific change.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Spec earns its keep&lt;/th&gt;
&lt;th&gt;Spec is just ceremony&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Blast radius&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Touches many modules / public APIs&lt;/td&gt;
&lt;td&gt;One file, contained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reversibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hard to undo (migrations, schemas)&lt;/td&gt;
&lt;td&gt;Trivial to revert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ambiguity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requirements genuinely unclear&lt;/td&gt;
&lt;td&gt;You already know the exact diff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audience&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Others must review/maintain&lt;/td&gt;
&lt;td&gt;Throwaway or solo-spike&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repetition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pattern you'll repeat 10×&lt;/td&gt;
&lt;td&gt;One-off&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If most of your signals sit in the right column, the spec &lt;em&gt;is&lt;/em&gt; the tax. Write the code.&lt;/p&gt;

&lt;p&gt;A composite from the kind of work this filter is built for (details anonymized; treat it as illustrative, not a case study): a payments service had a settlement module nobody wanted to touch, the original authors long gone, behavior documented only by the tests that happened to pass. The task was to add a new payout currency. Every signal sat in the left column: blast radius across a dozen call sites, an irreversible ledger migration, requirements that turned out to mean three different things depending on who you asked, and a change the on-call team would own for years. The first instinct was to let the agent loose on it. The right move was the opposite. An hour spent writing down what "settled" actually meant, in EARS form, surfaced two contradictions between the rounding rules and the reconciliation job before a single line changed. The spec didn't slow the work down; it caught the bug that would have shipped. That is the left column earning its keep. The same agent, pointed at a one-line config flag the week before, would have produced nothing but a longer paper trail.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. The Law of Surplus Structure
&lt;/h2&gt;

&lt;p&gt;The claim, stated plainly: every artifact you add to an agent's context consumes reasoning budget, and if it doesn't reduce uncertainty, it's not governance, it's tax. This isn't a vibe; it's measurable from two independent directions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direction one, token cost.&lt;/strong&gt; Jamie Telin ran OpenSpec against Spec Kit on the same task (streaming + session support for a chat app), twice, using GPT-5.2. The leaner framework won both times, and the gap was not small.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Measurement&lt;/th&gt;
&lt;th&gt;OpenSpec&lt;/th&gt;
&lt;th&gt;Spec Kit&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Test 1, total tokens&lt;/td&gt;
&lt;td&gt;~57,740&lt;/td&gt;
&lt;td&gt;~120,947&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+109%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test 2, planning&lt;/td&gt;
&lt;td&gt;38,117&lt;/td&gt;
&lt;td&gt;96,298&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+152%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test 2, implementation&lt;/td&gt;
&lt;td&gt;53,612&lt;/td&gt;
&lt;td&gt;84,742&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+58%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test 2, total&lt;/td&gt;
&lt;td&gt;91,729&lt;/td&gt;
&lt;td&gt;181,040&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+97%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;More upfront structure nearly doubled total token usage without improving outcomes. OpenSpec also hit a higher success rate with roughly 20% fewer assistant turns and 25% fewer tool calls. &lt;em&gt;(Source: Jamie Telin, "Spec Driven Development Is Wasting Tokens," Mar 2026.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direction two, a controlled study.&lt;/strong&gt; A 2026 paper from ETH Zurich, &lt;em&gt;Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?&lt;/em&gt; (Gloaguen, Mündler, Müller, Raychev, Vechev; arXiv, Feb 2026), tested the intuitive belief that handing an agent a structured repository overview helps it. They evaluated two settings: established SWE-bench tasks paired with LLM-generated context files written to the agent vendors' own recommendations, and a fresh collection of real-world issues drawn from repositories that already ship developer-written context files. The result cut against the intuition. Across multiple agents and models, context files &lt;em&gt;reduced&lt;/em&gt; task success rates compared with giving the agent no repository context at all, while raising inference cost by over 20%.&lt;/p&gt;

&lt;p&gt;Read that twice. Both the machine-written and the human-written files made outcomes worse on balance, not better, and they did it while costing more. The agents didn't ignore the files; they obeyed them, explored more broadly, ran more tests, traversed more files, and "thought" harder without producing better final patches. I call this failure mode the &lt;strong&gt;compliance loop trap&lt;/strong&gt;: the agent spends its cognitive budget satisfying the structural guardrails instead of solving the problem, and the diligence is real but misdirected. The authors' own conclusion is the thesis of this entire post: unnecessary requirements from context files make tasks harder, and human-written context should describe only minimal requirements. Everything beyond that is surplus. This is the second tax I promised in Section 1: ambiguity is expensive, and so is its overcorrection.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Token Economics Is Architecture
&lt;/h2&gt;

&lt;p&gt;If structure has a token price, then context budget is an architectural resource to be allocated, not spent reflexively. Treat it like memory in an embedded system.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cost driver&lt;/th&gt;
&lt;th&gt;Mitigation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Verbose, always-loaded specs&lt;/td&gt;
&lt;td&gt;Load specs lazily, scoped to the task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redundant restatement across artifacts&lt;/td&gt;
&lt;td&gt;Single source of truth per fact; reference, don't repeat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sub-agents rebuilding context&lt;/td&gt;
&lt;td&gt;Pass distilled state, not full history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-file divergence&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;State checkpoints&lt;/strong&gt;: snapshot agreed truth before fan-out&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The discipline: spend tokens where they reduce uncertainty, starve everything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. EARS: Making Natural Language Less Ambiguous
&lt;/h2&gt;

&lt;p&gt;If you're going to write requirements, write them in a form that resists misreading. &lt;strong&gt;EARS&lt;/strong&gt; (Easy Approach to Requirements Syntax), developed by Mavin et al. at Rolls-Royce and presented at the IEEE Requirements Engineering conference (RE'09), constrains prose into a small set of patterns, and it's been adopted at Airbus, Bosch, Dyson, Honeywell, Intel, NASA, Rolls-Royce, and Siemens. The template:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;While&lt;/strong&gt; &lt;code&gt;&amp;lt;optional pre-condition&amp;gt;&lt;/code&gt;, &lt;strong&gt;when&lt;/strong&gt; &lt;code&gt;&amp;lt;optional trigger&amp;gt;&lt;/code&gt;, the &lt;code&gt;&amp;lt;system name&amp;gt;&lt;/code&gt; &lt;strong&gt;shall&lt;/strong&gt; &lt;code&gt;&amp;lt;system response&amp;gt;&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;, the kind of requirement an agent will happily misinterpret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The system should handle expired tokens gracefully and clean up sessions,
making sure not to leak any sensitive data.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What's "gracefully"? Clean up when? Leak to where? Each gap is a guess waiting to happen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt;, EARS-structured and unambiguous:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WHEN an identity token expires,
THE SYSTEM SHALL invalidate the active session cache within 500ms.

IF cache eviction fails,
THEN THE SYSTEM SHALL retry up to 3 times,
log a structured JSON error with a correlation ID,
and SHALL NOT persist plain-text PII in telemetry.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same intent, zero room for creative interpretation. Note that EARS &lt;em&gt;adds&lt;/em&gt; words but &lt;em&gt;removes&lt;/em&gt; uncertainty, which is exactly the trade the Law of Surplus Structure says is worth making. Structure that reduces ambiguity isn't tax; structure that merely decorates is.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. The Reality Check
&lt;/h2&gt;

&lt;p&gt;Six failure modes I've watched SDD run into. None is a reason to abandon it; each is a reason to apply the decision filter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review overload.&lt;/strong&gt; A spec that generates 800 lines of artifacts moves the bottleneck from writing code to reviewing specs. You haven't removed work, you've relocated it. If spec review is slower than the code review it replaced, the spec is tax.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;False control.&lt;/strong&gt; A detailed spec &lt;em&gt;feels&lt;/em&gt; like control, but the agent can satisfy every line and still produce something wrong, because the spec encoded your misunderstanding faithfully. Precision is not correctness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spec/code drift.&lt;/strong&gt; In spec-anchored workflows, the spec and code diverge the moment someone edits code directly and skips the spec. Now you have two sources of truth and no way to know which is right. Drift turns a contract back into a stale comment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The multi-file divergence trap.&lt;/strong&gt; When an agent fans out across many files, each can drift toward a different interpretation. State checkpoints, snapshotting agreed truth before parallel work, are the only reliable defense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Natural language bottoms out.&lt;/strong&gt; Even EARS can't make "intuitive UX" machine-precise. Some intent is irreducibly fuzzy, and pretending otherwise just produces confident wrong answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spec-as-source repeats old risks.&lt;/strong&gt; "Edit only the spec, regenerate the code" is the dream, but it reinvents the problems of code generation: opaque output, debugging a thing you didn't write, and trusting a compiler you can't fully inspect.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. Adoption Strategy
&lt;/h2&gt;

&lt;p&gt;Don't roll out SDD as a mandate. Roll it out where the ambiguity tax is highest, prove it, then expand.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Focus&lt;/th&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weeks 1 to 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pick one high-blast-radius, high-ambiguity workstream&lt;/td&gt;
&lt;td&gt;Feel where specs &lt;em&gt;earn&lt;/em&gt; their keep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weeks 3 to 4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Add EARS for the requirements that bite&lt;/td&gt;
&lt;td&gt;Reduce misinterpretation, measure review time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Month 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Introduce one execution-layer tool (e.g., a verification gate)&lt;/td&gt;
&lt;td&gt;Catch spec/code drift automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Month 3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Codify your &lt;em&gt;own&lt;/em&gt; decision filter&lt;/td&gt;
&lt;td&gt;Make "spec or skip?" a team reflex, not a ritual&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The goal isn't "we do SDD now." It's "we know exactly when SDD pays, and we skip it when it doesn't."&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Spec-driven development is not a methodology you adopt wholesale. It's a cost-management strategy for the two taxes that bracket every AI-assisted change: the ambiguity tax on the left, the surplus-structure tax on the right. Good engineering is finding the bottom of that curve, per change, not per team. So the rule is simple, and it's the whole post in one line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Spec it when ambiguity is expensive. Skip it when the code is cheaper than the ceremony.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Böckeler, B.&lt;/strong&gt;, &lt;em&gt;Exploring Generative AI&lt;/em&gt; (spec-driven development levels): &lt;a href="https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html" rel="noopener noreferrer"&gt;https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gloaguen, T. et al.&lt;/strong&gt;, &lt;em&gt;Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?&lt;/em&gt; (arXiv 2602.11988): &lt;a href="https://arxiv.org/abs/2602.11988" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2602.11988&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mavin, A. et al.&lt;/strong&gt;, &lt;em&gt;Easy Approach to Requirements Syntax (EARS)&lt;/em&gt;, IEEE RE'09: &lt;a href="https://ieeexplore.ieee.org/document/5328509" rel="noopener noreferrer"&gt;https://ieeexplore.ieee.org/document/5328509&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telin, J.&lt;/strong&gt;, &lt;em&gt;Spec Driven Development Is Wasting Tokens&lt;/em&gt;: &lt;a href="https://medium.com/it-chronicles/is-your-safe-choice-burning-your-budget-1cfddf8782e4" rel="noopener noreferrer"&gt;https://medium.com/it-chronicles/is-your-safe-choice-burning-your-budget-1cfddf8782e4&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Negrisolo, V.&lt;/strong&gt;, &lt;em&gt;OpenSpec vs Spec Kit: Choosing the Right AI-Driven Development Workflow&lt;/em&gt; (the 800-vs-250-line comparison): &lt;a href="https://hashrocket.com/blog/posts/openspec-vs-spec-kit-choosing-the-right-ai-driven-development-workflow-for-your-team" rel="noopener noreferrer"&gt;https://hashrocket.com/blog/posts/openspec-vs-spec-kit-choosing-the-right-ai-driven-development-workflow-for-your-team&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Spec Kit&lt;/strong&gt;: &lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;https://github.com/github/spec-kit&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenSpec&lt;/strong&gt;: &lt;a href="https://github.com/Fission-AI/OpenSpec" rel="noopener noreferrer"&gt;https://github.com/Fission-AI/OpenSpec&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Koul, S.&lt;/strong&gt;, &lt;em&gt;The Architect's Dilemma: Relevance of SMEs in the Age of AI Agents&lt;/em&gt; (a thematic companion on "human debt" and the judgment AI can't replace): &lt;a href="https://susheemk.substack.com/p/the-architects-dilemma-relevance" rel="noopener noreferrer"&gt;https://susheemk.substack.com/p/the-architects-dilemma-relevance&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>architecture</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
