Most AI agent demos optimize for the first successful run.
Production teams care about the tenth failed run.
Once you have more than one agent, the hard questions change:
- Which agent touched this file?
- Which tool call created this artifact?
- Which MCP server was available at the time?
- Which model and prompt version produced the decision?
- Did a human approve the action, or was it auto-allowed?
- What changed between the last good run and this bad one?
- Can we pause, replay, repair, or roll back the run?
That is why I think every serious agent system needs an agent run record.
What is an agent run record?
It is a compact, inspectable record of what happened during an agent run.
Not a giant log dump. Not only traces. Not a chat transcript.
A useful run record should answer:
Given this result, what exactly happened, under which configuration, with which tools, and what evidence do we have?
For a coding, browser, or MCP-backed agent, I would want at least:
-
runId: the full agent or workflow run -
turnId: the user turn that triggered work -
agentId: the agent or sub-agent responsible -
model: provider, model id, and configuration -
promptVersion: template or instruction hash -
toolRegistry: which tools or MCP servers were available -
toolCallId: a stable id for every tool invocation -
sideEffect: read, write, exec, network, deploy, payment, etc. -
approvalState: not required, requested, approved, denied, expired -
inputRefs: references to inputs without storing sensitive payloads forever -
outputRefs: artifacts, files, PRs, browser actions, or generated data -
retryState: retries, timeouts, fallback model routes -
finalStatus: succeeded, failed, paused, escalated, rolled back
This sounds boring until an agent does something surprising.
Then it becomes the only thing anyone wants.
Traces are not enough
OpenTelemetry-style traces are useful. They help with latency, errors, retries, and service boundaries.
But an agent operator often needs a different object.
A trace can tell you which span was slow.
A run record should tell you:
- what the agent believed it was doing
- which tools it was allowed to use
- which actions had side effects
- which policy or approval state applied
- which artifact resulted
- whether this run differs from a known-good run
In other words:
Traces explain execution.
Run records explain responsibility.
You need both.
MCP makes this more important
MCP is great because it gives agents a common way to access tools and context.
It also means agents can suddenly interact with many more systems:
- databases
- browsers
- repos
- cloud APIs
- internal tools
- local files
- long-running services
That makes the tool boundary the operational boundary.
If a model calls an MCP tool, I want to know:
- which host/client initiated the call
- which MCP server executed it
- which exact tool schema was active
- which arguments were passed
- whether the call was read-only or had side effects
- whether approval was required
- what the tool returned
Without that, debugging becomes archaeology.
The multi-agent version is harder
Single-agent runs are already tricky.
Multi-agent runs add handoffs.
Now you also need:
- parent and child run ids
- supervisor/sub-agent relationships
- shared state versions
- artifacts passed between agents
- cost attribution per agent
- escalation and retry ownership
If Agent A delegates to Agent B, which calls Tool C, which writes File D, the run record should preserve that chain.
Otherwise your cost dashboard says "agent run cost $8" and your logs say "tool call succeeded," but nobody can explain why the final output is wrong.
What I would build first
If you are building agents today, I would start with a small run-record schema before adding more autonomy.
The first version can be simple:
{
"runId": "run_123",
"turnId": "turn_456",
"agentId": "research-agent",
"model": "provider/model",
"toolsAvailable": ["web.search", "github.read", "mcp.filesystem.write"],
"toolCalls": [
{
"toolCallId": "call_001",
"tool": "github.read",
"sideEffect": "read",
"status": "success"
},
{
"toolCallId": "call_002",
"tool": "mcp.filesystem.write",
"sideEffect": "write",
"approvalState": "approved",
"status": "success"
}
],
"artifacts": ["notes.md"],
"finalStatus": "succeeded"
}
Then make it easy to ask:
- show me all runs that wrote files
- show me all runs with denied approvals
- show me all runs that used this MCP server
- compare this failed run to the last successful one
- show me every artifact created by this user turn
That is when agents start to feel operable.
Where Armorer fits
This is the direction we are building toward with Armorer.
Armorer is a local control plane for AI agents. The goal is to make agent runs, tools, approvals, sandboxes, audit trails, and artifacts inspectable on your own machine instead of treating every agent as an opaque chat window.
Repo: https://github.com/ArmorerLabs/Armorer
The bet is simple:
As agents get more capable, the bottleneck moves from "can it do the task?" to "can I understand, govern, and repair what it did?"
That layer is still early.
But I think it is where a lot of practical agent engineering is heading.
Top comments (0)