Why your AI agent logs are not evidence and what to do about it

#agents #python #opensource #elixir

The problem

Your agent failed in production. You look at the logs. They don't give you the full picture. So you run the agent again with the exact same inputs. It succeeds. Or it fails differently. Classic.

LLM calls, time-dependent code, tool side effects, and stochastic sampling mean "same inputs, same outputs" is completely false for AI systems. You have no idea what actually happened in the first run. The original context is gone, and re-running is not replaying.

This is the problem Span Chain was built to solve.

Logs vs evidence

Logs are claims. Not evidence.

A standard log or trace is just a JSON blob. A buggy retention job can orphan a span. An attacker can rewrite it. The agent itself might hallucinate and log bad data.

If your trace data is mutable, it is not evidence. It is a claim about what happened, written after the fact. Span Chain treats every event as an immutable, cryptographically sealed record. You cannot rewrite history without breaking the chain.

What tamper-evident means in practice

Span Chain uses a SHA-256 hash chain. Every event during an agent session is appended to an immutable ledger. The hash input covers the sequence, the previous hash, the exact payload, the parent span, the run ID, and the epoch. Change one byte of an old span and the chain breaks.

This is what separates Span Chain from standard LLM observability tools like LangSmith or Langfuse. Those show you what happened. Span Chain lets you prove it.

Verification is a single API call:

curl http://localhost:4001/api/runs/your-run-id/verify \
  -H "Authorization: Bearer <token>"

{"valid": true, "span_count": 12}
{"valid": false, "chain_broken_at_seq": 7}

One changed byte anywhere in history. You know immediately.

The replay cost trap

Debugging by re-running the agent is a trap. Every retry is another live LLM call. That costs money and latency.

Span Chain solves this with VCR-style cassette replay. It reads the exact payload stream from the database and feeds it back to the system. No LLM, no API credits. Replay costs $0.

Here is how you instrument an agent with the Span Chain Python SDK:

import spanchain as gf

gf.init(
    endpoint="http://localhost:4000",
    api_key="your-api-key",
    run_id="agent-run-001",
)

@gf.trace(name="agent_run")
async def agent_run(task):
    async with gf.span("llm_call"):
        result = await llm.complete(task)
    async with gf.span("tool_call", tool_name="search"):
        results = await search(task)
    return result

The Span Chain SDK is intentionally dumb. It exports spans as OTLP to the backend and nothing else. All cryptographic sequencing happens server-side. The client cannot forge a clean chain even if it tries.

Model upgrades

When you swap models, your agent's behavior changes. How do you know what broke?

Span Chain lets you replay old cassettes through the new model and run a structural comparison. The comparator flags the exact span where behavior diverged. Not just "Run B was slower" but the first point where the two runs split. If the new model added a tool call or skipped a step, you see it immediately. You stop guessing.

How I got here

I kept running into the same wall: agent fails, logs tell you nothing useful, you re-run and get a different result. Existing tools were not built for this. They produce mutable data with no replay capability.

So I built Span Chain, an auditable harness for production AI agents. The backend runs on Elixir/OTP, where every agent session gets its own isolated BEAM process (~2 KB heap). A crash in one agent does not touch the others. That is how you get 1,000 concurrent agents, 10,000 spans, 571 spans/sec, and 0 corrupted chains.

Span Chain is MIT licensed and self-hosted. Edit .env and set POSTGRES_PASSWORD and GF_API_KEY, then:

git clone https://github.com/ghostfactory-art/spanchain
cd spanchain
cp .env.example .env
docker compose up

The repo is at github.com/ghostfactory-art/spanchain.

*Footnote: EU AI Act Article 12 requires automatic event logging and traceability for high-risk AI systems (Annex III obligations expected from December 2027, pending formal adoption of the AI Omnibus agreed in May 2026). The law does not mandate tamper-evidence, but a log that can be silently rewritten is hard to defend as traceability. Span Chain gives you evidence-grade records that stand up to scrutiny.