Most agent observability discussions collapse two different needs into one bucket.
I want traces.
I also want receipts.
They are not the same thing.
A trace is for depth. It answers:
- what happened when
- which model call ran
- which tool call followed
- how long each step took
- where latency or retries came from
- what prompt/input/output moved through the system
That is useful when debugging.
But after an agent finishes a real workflow, I often need something smaller and more operational:
What changed, and can I trust this run?
That is what I think of as a run receipt.
What belongs in a trace
A trace can be verbose because its job is to preserve detail.
For an agent run, a trace might include:
- spans for model calls
- spans for tool calls
- raw request/response metadata
- timing
- retries
- token counts
- errors
- intermediate reasoning events where available
- parent/child relationships between steps
This is the right place for timeline reconstruction.
If I need to understand why a run became slow, why a model retried, or why one branch of a workflow failed, I want the trace.
But traces are often too noisy for day-to-day operation.
If an agent says "done," I do not always want to read every span. I want a compact artifact that tells me whether the run is safe to accept, resume, roll back, or investigate.
What belongs in a receipt
A receipt should be boring and reviewable.
For a coding agent, a receipt might include:
- repo and branch
- files touched
- commands run
- tools used
- action classes: read, write, exec, network, admin
- checks attempted
- checks passed or failed
- external state changed
- approvals requested
- approvals granted or denied
- artifacts produced
- links back to trace IDs or logs
For an MCP-heavy agent, I would also want:
- server name
- tool name
- operation type
- side-effect category
- dry-run support
- target resource
- decision ID if a policy/gate was evaluated
- result status
The receipt does not replace the trace. It points back to it.
The trace is the evidence archive.
The receipt is the operator summary.
Why this matters
Agents are starting to do work that leaves state behind.
They edit files, call APIs, mutate databases, send messages, update tickets, run shell commands, and write memory.
When something goes wrong, the painful question is often not:
What was the prompt?
It is:
What did the agent believe, what did it touch, and which part should I unwind?
That is where receipts become useful.
A receipt lets you compare runs without opening every trace. It gives a human enough context to decide:
- accept
- reject
- resume
- roll back
- replay
- escalate
This also helps with evals.
If every failed run leaves behind structured receipt fields, you can build evals from real failures instead of only hand-written test cases.
The design line I like
My current bias is:
- traces should stay rich
- receipts should stay small
- receipts should use stable IDs to join back to traces
- receipts should preserve action classes and side effects
- receipts should record human approval outcomes
- receipts should be cheap enough to keep on by default
In other words:
The trace explains the run. The receipt lets you operate it.
I am exploring this direction in Armorer, a local control plane for AI agents, and Armorer Guard, a runtime boundary layer for agent/tool decisions.
The open question I keep coming back to:
What is the minimum receipt that would actually help you trust, resume, or debug an agent run?
For me, the first useful version is probably:
{
"run_id": "run_123",
"agent": "claude-code",
"workspace": "repo-name",
"actions": [
{
"tool": "shell",
"action_class": "exec",
"command": "pnpm test",
"status": "failed",
"trace_id": "span_abc"
}
],
"state_changes": [
{
"type": "file_write",
"target": "src/app.ts"
}
],
"checks": [
{
"name": "unit tests",
"status": "failed"
}
],
"approvals": [],
"outcome": "needs_review"
}
Small enough to read.
Structured enough to automate.
Linked enough to investigate.
That feels like the missing layer between raw traces and blind trust.
Top comments (0)