Armorer Labs

Posted on Jun 1

Five Fields AI Agent Run Receipts Probably Need

#observability

I have been asking agent builders what they want in a run receipt after an AI agent finishes a task.

The answers were better than my original schema.

The common theme:

A receipt should not be a transcript summary. It should help someone decide whether the run can be trusted, resumed, reverted, or handed to another agent.

Here are the five fields that kept coming up.

1. Resume verdict

Outcome is not enough.

An agent can finish with code changes, but the run may still be unsafe to continue.

A useful receipt should say whether the next human or agent can safely resume:

{
  "resume_verdict": "complete | partial | unsafe_to_resume",
  "next_safe_action": "run migration check before continuing",
  "resume_from": "commit-or-checkpoint-id"
}

This field answers:

can I keep going?
should I review first?
should I roll back?
should the next agent start from a specific checkpoint?

2. Claimed vs verified

Agent transcripts often blur the line between what the agent changed and what it proved.

A receipt should keep that boundary explicit:

{
  "claim": "API now requires header X",
  "verifier": "integration-test api-auth",
  "status": "passed | failed | skipped",
  "assumption": "client rollout not verified"
}

This is especially useful when resuming a coding run.

The next agent should not inherit an unverified claim as fact.

3. Assumptions the agent could not verify

This may be one of the highest-value fields for daily users.

Agents make assumptions constantly:

a database column exists
a migration ran
a service is reachable
a test covers the changed behavior
a user workflow still works

The problem is not that assumptions exist.

The problem is that they disappear into the transcript.

A receipt should surface them directly.

4. Tried and reverted

Long agent runs often contain dead ends.

The final diff hides those attempts.

But when another human or agent resumes the task, the dead ends matter. Otherwise the next session can spend time rediscovering the same failed approach.

Example:

{
  "approach": "changed middleware order",
  "reason": "broke auth fallback",
  "reverted": true
}

This is not just history. It is useful working memory.

5. Programmatic evidence

The receipt should not be something the model writes about itself.

Where possible, it should be assembled from the harness, runtime, git, CI, tool layer, and approval system.

Git already tells us what changed.

CI already tells us which checks passed.

The agent receipt should join that with agent-specific evidence:

tools used
action classes
approvals requested
approvals granted or denied
skipped checks
external side effects
trace IDs
next safe action

Claude summarizing its own work can be helpful, but it should not be the source of truth.

A small schema sketch

This is the shape I am leaning toward now:

{
  "run_id": "run_123",
  "workspace": "repo-name",
  "git": {
    "branch": "feature/x",
    "base_commit": "abc",
    "head_commit": "def"
  },
  "resume_verdict": "partial",
  "next_safe_action": "run migration check",
  "actions": [
    {
      "tool": "shell",
      "action_class": "exec",
      "command": "pnpm test",
      "status": "failed",
      "trace_id": "span_1"
    }
  ],
  "claims": [
    {
      "claim": "API now requires header X",
      "verifier": "integration test",
      "status": "skipped",
      "assumption": "not verified against staging"
    }
  ],
  "tried_and_reverted": [
    {
      "approach": "changed middleware order",
      "reason": "broke auth fallback"
    }
  ],
  "checks": [
    {
      "name": "unit tests",
      "status": "passed"
    }
  ],
  "external_state_changes": [],
  "approvals": []
}

Small enough to read.

Structured enough to automate.

Grounded enough to trust.

That is the kind of receipt I want around local coding agents, MCP-heavy workflows, and multi-agent systems.

I am exploring this in Armorer and Armorer Guard, but I think the pattern is bigger than one tool.

The open question I still have:

Should the resume verdict live only at the whole-run level, or should every subtask have its own verdict too?

DEV Community