Trace vs Receipt: What AI Agent Runs Need After They Finish

#devtools #observability

Most agent observability discussions collapse two different needs into one bucket.

I want traces.

I also want receipts.

They are not the same thing.

A trace is for depth. It answers:

what happened when
which model call ran
which tool call followed
how long each step took
where latency or retries came from
what prompt/input/output moved through the system

That is useful when debugging.

But after an agent finishes a real workflow, I often need something smaller and more operational:

What changed, and can I trust this run?

That is what I think of as a run receipt.

What belongs in a trace

A trace can be verbose because its job is to preserve detail.

For an agent run, a trace might include:

spans for model calls
spans for tool calls
raw request/response metadata
timing
retries
token counts
errors
intermediate reasoning events where available
parent/child relationships between steps

This is the right place for timeline reconstruction.

If I need to understand why a run became slow, why a model retried, or why one branch of a workflow failed, I want the trace.

But traces are often too noisy for day-to-day operation.

If an agent says "done," I do not always want to read every span. I want a compact artifact that tells me whether the run is safe to accept, resume, roll back, or investigate.

What belongs in a receipt

A receipt should be boring and reviewable.

For a coding agent, a receipt might include:

repo and branch
files touched
commands run
tools used
action classes: read, write, exec, network, admin
checks attempted
checks passed or failed
external state changed
approvals requested
approvals granted or denied
artifacts produced
links back to trace IDs or logs

For an MCP-heavy agent, I would also want:

server name
tool name
operation type
side-effect category
dry-run support
target resource
decision ID if a policy/gate was evaluated
result status

The receipt does not replace the trace. It points back to it.

The trace is the evidence archive.

The receipt is the operator summary.

Why this matters

Agents are starting to do work that leaves state behind.

They edit files, call APIs, mutate databases, send messages, update tickets, run shell commands, and write memory.

When something goes wrong, the painful question is often not:

What was the prompt?

It is:

What did the agent believe, what did it touch, and which part should I unwind?

That is where receipts become useful.

A receipt lets you compare runs without opening every trace. It gives a human enough context to decide:

accept
reject
resume
roll back
replay
escalate

This also helps with evals.

If every failed run leaves behind structured receipt fields, you can build evals from real failures instead of only hand-written test cases.

The design line I like

My current bias is:

traces should stay rich
receipts should stay small
receipts should use stable IDs to join back to traces
receipts should preserve action classes and side effects
receipts should record human approval outcomes
receipts should be cheap enough to keep on by default

In other words:

The trace explains the run. The receipt lets you operate it.

I am exploring this direction in Armorer, a local control plane for AI agents, and Armorer Guard, a runtime boundary layer for agent/tool decisions.

The open question I keep coming back to:

What is the minimum receipt that would actually help you trust, resume, or debug an agent run?

For me, the first useful version is probably:

{
  "run_id": "run_123",
  "agent": "claude-code",
  "workspace": "repo-name",
  "actions": [
    {
      "tool": "shell",
      "action_class": "exec",
      "command": "pnpm test",
      "status": "failed",
      "trace_id": "span_abc"
    }
  ],
  "state_changes": [
    {
      "type": "file_write",
      "target": "src/app.ts"
    }
  ],
  "checks": [
    {
      "name": "unit tests",
      "status": "failed"
    }
  ],
  "approvals": [],
  "outcome": "needs_review"
}