Runtime receipts for AI agents: a minimal schema

#ai #agents #mcp #devops

Most agent discussions still collapse into prompts, models, or frameworks.

Those matter, but the thing I keep wanting after an agent run is much simpler:

What did this agent actually do, what surface area did it touch, and what evidence do I have if I need to review or replay it?

I think agent systems need runtime receipts.

What I mean by a receipt

A runtime receipt is not a full trace, and it is not a chat transcript.

A trace tells you where execution went.

A transcript tells you what the model saw and said.

A receipt tells you what responsibility the agent took.

The minimal shape I am experimenting with looks like this:

{
  "receipt_id": "rcpt_01",
  "run_id": "run_01",
  "parent_run_id": null,
  "agent": {
    "name": "local-coding-agent",
    "version": "0.1.19"
  },
  "task": {
    "summary": "Update a local app configuration",
    "scope": "repo-local files only"
  },
  "tools": [
    {
      "name": "filesystem.write",
      "server": "local-tools",
      "action_class": "write",
      "target": "config/app.json",
      "decision": "allowed"
    }
  ],
  "checks": [
    {
      "name": "config validation",
      "status": "passed"
    }
  ],
  "state_changes": [
    {
      "type": "file_update",
      "target": "config/app.json"
    }
  ],
  "outcome": {
    "status": "completed",
    "recovery": null
  }
}

The important bit is not this exact JSON. It is the idea that every run leaves behind a compact operational artifact.

Why this matters

Once agents use tools, MCP servers, browsers, terminals, queues, and background jobs, the final answer is not enough.

For production or even serious local workflows, I want to know:

Which tool calls were read-only versus state-changing?
Which checks ran, and which were skipped?
Was an action approved, blocked, retried, or escalated?
Did the agent touch local files, network, browser state, or a remote API?
Can I compare this run against the last successful run?
Can an evaluator score operational behavior, not just the final message?

This becomes especially useful when you have multiple agents. The parent agent may say "done", but the receipt graph should show which child agents ran, what each one changed, and where the system had to recover.

Where this fits with traces

I do not see receipts as a replacement for OpenTelemetry or framework traces.

They sit beside them.

Use traces for timing, spans, retries, and execution shape.

Use receipts for capability surface, decisions, state changes, approvals, and review evidence.

The useful bridge is IDs:

run_id
trace_id
span_id
tool_call_id
policy_decision_id
artifact_id

That lets a dashboard move between "what happened technically?" and "what did the agent take responsibility for?"

Why I am building around this

I am working on Armorer as a local ops layer for AI agents, and Armorer Guard as the boundary layer that can reason about tool actions and decisions.

The current direction is:

Armorer tracks runs, jobs, setup state, and recovery.
Guard produces structured decisions around action boundaries.
Receipts become the common artifact that connects agent runs, evals, approvals, and debugging.

The GitHub discussion for the evolving receipt shape is here:

https://github.com/ArmorerLabs/Armorer/discussions/43

And the repo is here:

https://github.com/ArmorerLabs/Armorer

Open questions

I am still working through a few design choices:

Should receipts be emitted by the agent framework, a wrapper, or the control plane?
How small can the schema stay before it becomes too vague?
Should MCP tools advertise action classes directly in metadata?
How much state-change detail is useful without creating a privacy problem?
Should eval harnesses consume receipts as first-class inputs?

My current bias: receipts should be boring, append-only, and easy to diff.

If agents are going to act on our behalf, "it said it completed the task" is too weak. We need a small artifact that says what actually happened.