9:04 a.m.
A ticket lands. A customer ran your agent yesterday, it called the wrong tool, deleted the wrong record, and now there is a sc...
For further actions, you may consider blocking this person and/or reporting abuse
I'd push back here - if you can't reproduce it, the log was incomplete before deploy. reproducibility is the symptom. designing the output to carry its own reasoning and tool call chain is the fix.
The framing that helped us most was exactly your replay-over-determinism split — once you stop trying to make the model deterministic and start capturing what actually happened, the whole problem collapses to one question: where do you put the recording boundary?
If you log the prompt and the final answer, you can't replay, because every interesting divergence lives in between — each sampled completion, each tool response. The capture has to sit at the I/O boundary of every non-deterministic step: the exact bytes the model sampled, and the exact bytes each tool returned, in order. Replay then feeds those recordings back instead of re-calling anything, so the serving-stack non-determinism everyone points at simply stops mattering — you're not re-running the model, you're replaying a transcript it can't deviate from.
The part that bites is hiding in your opening: a tool call that deleted the wrong record is non-deterministic AND has side effects. So replaying that run has to serve the recorded tool response and never actually re-invoke the tool — otherwise your debugging session deletes a second record. That recording boundary is the whole difference between replaying the failure and re-running it into a new one.
The canonicalization step is the bit nobody writes about and it's where half the pain lives. Stripping run_id, timestamp, trace_id, created_at before you key the fixture is the difference between a replay that matches and one that drifts on the first re-run. Agreed completely, and it's a thankless thing to maintain.
The part that nags me: the key encodes a guess about what's load-bearing. You keep what you think identifies "the same situation" and drop the rest. But a fix can change what's load-bearing. You patch the tool-selection logic and suddenly a field you'd been treating as noise is what decides the branch. The fixture still keys clean, still replays green, and it's now testing a situation that no longer exists.
So canonicalization keeps yesterday's bug reproducible. I don't think it keeps the key honest once the code under it moves.
The graph-boundary recording argument is right, and the tool-output recording is doing the heaviest lifting in agentic systems specifically. Once an agent's next action is conditioned on a tool return (chain read, DB at read-replica drift, third-party API), the captured tool bytes ARE the canonical replay world state, not just one input among many. The underlying system has moved on by replay time, so the recording is the only source of truth for the world the agent acted on. The envelope is half prompt-replay, half world-state snapshot.
One push on the fix-verification trap: even with the two-layer split working, you're testing the fix against a sample of size one. The captured 9:04 ticket. Fix passes the frozen fixture, ships, the very next ticket exposes an adjacent failure mode the first didn't surface. Recording-as-fixture solves the regression problem for the exact incident. It doesn't solve "what else does my fix break or fail to cover." Fuzz around the captured fixture (small perturbations of the prompt, the tool returns, the retrieved chunks) before declaring the fix shipped.
This is incredibly well written. The batch invariance section is the hardest part of the whole post.
We hit the exact same trap with MemBridge (our SQLite memory system for Hermes Agent). RRF five-route ranking looks like it should be deterministic — same query, same weights, same result — until you notice float32 addition isn't associative and different accumulation orders produce different rankings. Two weeks debugging what we thought was a routing strategy bug, and the root cause was arithmetic.
The record & replay framing is exactly where we landed: don't freeze generation (that kills quality, self-consistency, exploration), record at the orchestration boundary and replay the captured state. Are you doing this at the socket layer (VCR-style cassettes) or higher up at the graph state hydration layer?
The vLLM/Qwen-3-8B "80 distinct completions out of 1000 identical prompts" number is more damning than it reads. Even temperature=0 isn't a reproducibility contract on a shared serving stack.
Fixture key instability is the failure mode that quietly eats replay suites. Canonicalizing on the raw request body breaks the moment providers sneak in fields (request_id, system_fingerprint, served_model_name) that look like metadata but participate in the hash. What helps: an explicit allowlist of fields that go into the cache key (model alias, messages, tools, temperature, top_p, seed) and drop everything else before hashing.
Worth recording the response
system_fingerprintalongside the fixture too. Drift-induced replay misses become loud instead of silent.On the MoE routing point: capacity-factor batching means a fixture captured at p50 traffic might not replay if dev runs the request alone. Argues for replay-at-the-trace-layer (OpenInference spans) rather than replay-at-HTTP.
The temperature-zero trap is so real. We spent a month thinking "just set it to 0 and it's deterministic" — then watched it diverge on the exact same input because logits drifted between model versions. The replayability vs determinism framing would have saved us weeks. What's your approach to storing enough session state to actually replay a failure?
And it gets worse: providers can update weights, change batching behavior, drift logits between versions silently. Your "deterministic" outputs change without any deployment on your side. Recording at the I/O boundary is the only defense that survives provider updates.
This pain is real and the fix that stuck for us was boring: persist every input the agent saw, not just the conversation. Tool RESULTS especially, because the world changes under you and the database row that caused the failure will not exist tomorrow. Once the full run record is durable, replay in CI becomes possible and about half of our "cannot reproduce" bugs turned out to be reproducible after all. The other half were provider-side model drift, which the run record also surfaces, because the same inputs start producing different outputs with no diff on our side
What finally made these reproducible for us was capturing the full input envelope at failure time, not just the prompt but the retrieved context, tool outputs, and the model and params, and writing it straight into the eval set as a fixture. Most 'unreproducible' agent bugs are really just non-captured inputs: the run depended on a retrieval result or a tool response you did not log, so you cannot replay it. Once the failing trace becomes a stored test case, the bug is reproducible by construction and you get a regression test for free. The expensive part is plumbing the capture in before the incident, not after.
Reproducibility gets even worse when the agent interacts with a GUI. The screen state at step 12 depends on every action that came before it, and pixel-level differences in rendering can change model behavior. Running the agent against a deterministic local environment helps, but the real fix is building verify steps into the agent loop itself. Some approaches now use a separate adversary agent that independently reviews the builder agent's output without sharing context, which catches a surprising number of issues that automated tests miss.
The "where do you put the recording boundary" question is the right one. There's a second property missing: the record needs to be tamper-evident, not just complete. A complete log is enough for debugging. For compliance, a third party needs to verify the record wasn't altered after the incident. Same data, different integrity requirements.
Reproducibility is the missing layer in a lot of agent demos. If you cannot answer what context it saw, what tool calls it made, and why it chose the path it chose, debugging becomes archaeology.