DEV Community: Jiří Joneš

How to debug AI agent failures for $0 - VCR Cassette Replay explained

Jiří Joneš — Tue, 16 Jun 2026 10:28:01 +0000

The problem

Your agent failed. So you run it again with the exact same inputs.

It succeeds. Or it fails differently. You are chasing a Heisenbug.

This is the fundamental problem with debugging AI systems. LLM sampling is stochastic. Tool calls hit live APIs. External state changes between runs. Re-running is not replaying. The original execution context is gone forever.

There is also the cost trap. Every debug retry is another live LLM API call. If your agent makes ten reasoning steps and calls GPT-4o or Claude Sonnet five times per session, debugging a single logic error burns real API credits with zero guarantee you will reproduce the original failure.

The VCR Cassette pattern

To escape the cost trap and defeat non-determinism, stop re-running live models. Use the VCR Cassette pattern instead.

A cassette is a database-backed, immutable snapshot of the exact payload stream from a real agent run. When your agent executes in production, it generates a tree of spans — LLM calls, tool executions, reasoning steps. A cassette captures all raw payloads: exact inputs sent to the LLM, exact outputs returned, millisecond precision.

The pattern splits debugging into two phases: record and replay.

During replay, the system feeds the exact historical payload stream back from the database. The live LLM is never called. External tools are never executed. The result is strictly deterministic — same span data, stochastic model completely bypassed.

The analogy is a VCR from the 1980s. When you record a sports match, you capture the state of reality. You can rewind and replay it as many times as you want. The athletes are not playing again — you are watching the deterministic tape. Cassette replay brings this exact mechanic to AI agent debugging.

How Span Chain implements it

In Span Chain, the VCR Cassette pattern is a first-class citizen of the architecture, implemented in the Elixir/OTP backend.

Recording is handled by Cassettes.record(run_id). The backend reads all payload rows for the specified run from the Ledger, ordered by epoch_id and seq, and inserts a %Cassette{} snapshot. Span Chain uses a payload-first principle: raw, unadulterated payload maps — no truncation, no loss of nested data.

The replay path uses the exact same ingestion pipeline as live traffic. Data flows through SessionGenServer (which computes the hash), into the BufferProducer queue, through the Broadway pipeline, and into Ledger.insert_batch. Because it runs under a brand new run_id, it computes a fresh SHA-256 hash chain. Once replay finishes, the system calls verify_ledger automatically. If the ingestion is clean, you get hash_valid: true — cryptographic proof that the replay is structurally sound.

Finally, the Evals.Comparator performs a structural tree-diff between the source run and the replay. It pairs spans by name and sibling position, flags span_added, span_removed, duration_diff, and marks the exact deviation_point — the first divergent span in every branch.

Instrumenting your agent

Use the Span Chain Python SDK. It is intentionally dumb — just an OTLP exporter. All cryptographic sequencing happens server-side.

import ghostfactory as gf
import os

gf.init(
    endpoint="http://localhost:4000",
    api_key=os.environ["GF_API_KEY"]
)

@gf.trace(name="agent_run")
async def agent_run(task):
    async with gf.span("llm_call") as span:
        result = await llm.complete(task)
        span.set_attributes({"input": task, "output": result})
    return result

Once spans are flushed, trigger a replay:

curl -X POST http://localhost:4001/api/cassettes/<cassette_id>/replay \
  -H "Authorization: Bearer <your_token>"

The response is an immediate 202 Accepted with a job_id. Poll until completed:

{
  "status": "completed",
  "result": {
    "run_id": "replay-abc-123",
    "span_count": 42,
    "hash_valid": true,
    "diff": [
      {
        "type": "duration_diff",
        "span_name": "llm_call",
        "deviation_point": true,
        "val_a": 1200,
        "val_b": 0
      }
    ]
  }
}

The difference

Without cassette replay: every retry is a live LLM call, non-deterministic, costs money, and structural deviations between runs are invisible. You are reading plain JSON logs and guessing.

With Span Chain: replay reads from the historical cassette. No live APIs hit. Cost is $0. The Comparator gives you an explicit structural diff with the exact deviation_point. And because replay flows through the real pipeline, it generates its own SHA-256 hash chain — hash_valid: true. Your debug session leaves a tamper-evident audit trail.

Stop guessing what your agent did. Record the reality, replay it for free, prove it cryptographically.

Span Chain is MIT licensed and self-hosted. git clone, set POSTGRES_PASSWORD and GF_API_KEY in .env, then docker compose up. The repo is at github.com/ghostfactory-art/spanchain.

Note: spanchain will be on PyPI shortly. For now: pip install ./sdk/python

Why your AI agent logs are not evidence and what to do about it

Jiří Joneš — Fri, 12 Jun 2026 14:51:47 +0000

The problem

Your agent failed in production. You look at the logs. They don't give you the full picture. So you run the agent again with the exact same inputs. It succeeds. Or it fails differently. Classic.

LLM calls, time-dependent code, tool side effects, and stochastic sampling mean "same inputs, same outputs" is completely false for AI systems. You have no idea what actually happened in the first run. The original context is gone, and re-running is not replaying.

This is the problem Span Chain was built to solve.

Logs vs evidence

Logs are claims. Not evidence.

A standard log or trace is just a JSON blob. A buggy retention job can orphan a span. An attacker can rewrite it. The agent itself might hallucinate and log bad data.

If your trace data is mutable, it is not evidence. It is a claim about what happened, written after the fact. Span Chain treats every event as an immutable, cryptographically sealed record. You cannot rewrite history without breaking the chain.

What tamper-evident means in practice

Span Chain uses a SHA-256 hash chain. Every event during an agent session is appended to an immutable ledger. The hash input covers the sequence, the previous hash, the exact payload, the parent span, the run ID, and the epoch. Change one byte of an old span and the chain breaks.

This is what separates Span Chain from standard LLM observability tools like LangSmith or Langfuse. Those show you what happened. Span Chain lets you prove it.

Verification is a single API call:

curl http://localhost:4001/api/runs/your-run-id/verify \
  -H "Authorization: Bearer <token>"

{"valid": true, "span_count": 12}
{"valid": false, "chain_broken_at_seq": 7}

One changed byte anywhere in history. You know immediately.

The replay cost trap

Debugging by re-running the agent is a trap. Every retry is another live LLM call. That costs money and latency.

Span Chain solves this with VCR-style cassette replay. It reads the exact payload stream from the database and feeds it back to the system. No LLM, no API credits. Replay costs $0.

Here is how you instrument an agent with the Span Chain Python SDK:

import spanchain as gf

gf.init(
    endpoint="http://localhost:4000",
    api_key="your-api-key",
    run_id="agent-run-001",
)

@gf.trace(name="agent_run")
async def agent_run(task):
    async with gf.span("llm_call"):
        result = await llm.complete(task)
    async with gf.span("tool_call", tool_name="search"):
        results = await search(task)
    return result

The Span Chain SDK is intentionally dumb. It exports spans as OTLP to the backend and nothing else. All cryptographic sequencing happens server-side. The client cannot forge a clean chain even if it tries.

Model upgrades

When you swap models, your agent's behavior changes. How do you know what broke?

Span Chain lets you replay old cassettes through the new model and run a structural comparison. The comparator flags the exact span where behavior diverged. Not just "Run B was slower" but the first point where the two runs split. If the new model added a tool call or skipped a step, you see it immediately. You stop guessing.

How I got here

I kept running into the same wall: agent fails, logs tell you nothing useful, you re-run and get a different result. Existing tools were not built for this. They produce mutable data with no replay capability.

So I built Span Chain, an auditable harness for production AI agents. The backend runs on Elixir/OTP, where every agent session gets its own isolated BEAM process (~2 KB heap). A crash in one agent does not touch the others. That is how you get 1,000 concurrent agents, 10,000 spans, 571 spans/sec, and 0 corrupted chains.

Span Chain is MIT licensed and self-hosted. Edit .env and set POSTGRES_PASSWORD and GF_API_KEY, then:

git clone https://github.com/ghostfactory-art/spanchain
cd spanchain
cp .env.example .env
docker compose up

The repo is at github.com/ghostfactory-art/spanchain.

*Footnote: EU AI Act Article 12 requires automatic event logging and traceability for high-risk AI systems (Annex III obligations expected from December 2027, pending formal adoption of the AI Omnibus agreed in May 2026). The law does not mandate tamper-evidence, but a log that can be silently rewritten is hard to defend as traceability. Span Chain gives you evidence-grade records that stand up to scrutiny.

Note: spanchain will be on PyPI shortly. For now, install from source:
pip install ./sdk/python*