Armorer Labs

Posted on May 12

The hard part of AI agents is not building one. It is operating five.

#ai #agents #opensource #devops

Most AI agent demos optimize for the first successful run.

Production teams care about the tenth failed run.

Once you have more than one agent, the hard questions change:

Which agent touched this file?
Which tool call created this artifact?
Which MCP server was available at the time?
Which model and prompt version produced the decision?
Did a human approve the action, or was it auto-allowed?
What changed between the last good run and this bad one?
Can we pause, replay, repair, or roll back the run?

That is why I think every serious agent system needs an agent run record.

What is an agent run record?

It is a compact, inspectable record of what happened during an agent run.

Not a giant log dump. Not only traces. Not a chat transcript.

A useful run record should answer:

Given this result, what exactly happened, under which configuration, with which tools, and what evidence do we have?

For a coding, browser, or MCP-backed agent, I would want at least:

runId: the full agent or workflow run
turnId: the user turn that triggered work
agentId: the agent or sub-agent responsible
model: provider, model id, and configuration
promptVersion: template or instruction hash
toolRegistry: which tools or MCP servers were available
toolCallId: a stable id for every tool invocation
sideEffect: read, write, exec, network, deploy, payment, etc.
approvalState: not required, requested, approved, denied, expired
inputRefs: references to inputs without storing sensitive payloads forever
outputRefs: artifacts, files, PRs, browser actions, or generated data
retryState: retries, timeouts, fallback model routes
finalStatus: succeeded, failed, paused, escalated, rolled back

This sounds boring until an agent does something surprising.

Then it becomes the only thing anyone wants.

Traces are not enough

OpenTelemetry-style traces are useful. They help with latency, errors, retries, and service boundaries.

But an agent operator often needs a different object.

A trace can tell you which span was slow.

A run record should tell you:

what the agent believed it was doing
which tools it was allowed to use
which actions had side effects
which policy or approval state applied
which artifact resulted
whether this run differs from a known-good run

In other words:

Traces explain execution.

Run records explain responsibility.

You need both.

MCP makes this more important

MCP is great because it gives agents a common way to access tools and context.

It also means agents can suddenly interact with many more systems:

databases
browsers
repos
cloud APIs
internal tools
local files
long-running services

That makes the tool boundary the operational boundary.

If a model calls an MCP tool, I want to know:

which host/client initiated the call
which MCP server executed it
which exact tool schema was active
which arguments were passed
whether the call was read-only or had side effects
whether approval was required
what the tool returned

Without that, debugging becomes archaeology.

The multi-agent version is harder

Single-agent runs are already tricky.

Multi-agent runs add handoffs.

Now you also need:

parent and child run ids
supervisor/sub-agent relationships
shared state versions
artifacts passed between agents
cost attribution per agent
escalation and retry ownership

If Agent A delegates to Agent B, which calls Tool C, which writes File D, the run record should preserve that chain.

Otherwise your cost dashboard says "agent run cost $8" and your logs say "tool call succeeded," but nobody can explain why the final output is wrong.

What I would build first

If you are building agents today, I would start with a small run-record schema before adding more autonomy.

The first version can be simple:

{
  "runId": "run_123",
  "turnId": "turn_456",
  "agentId": "research-agent",
  "model": "provider/model",
  "toolsAvailable": ["web.search", "github.read", "mcp.filesystem.write"],
  "toolCalls": [
    {
      "toolCallId": "call_001",
      "tool": "github.read",
      "sideEffect": "read",
      "status": "success"
    },
    {
      "toolCallId": "call_002",
      "tool": "mcp.filesystem.write",
      "sideEffect": "write",
      "approvalState": "approved",
      "status": "success"
    }
  ],
  "artifacts": ["notes.md"],
  "finalStatus": "succeeded"
}

Then make it easy to ask:

show me all runs that wrote files
show me all runs with denied approvals
show me all runs that used this MCP server
compare this failed run to the last successful one
show me every artifact created by this user turn

That is when agents start to feel operable.

Where Armorer fits

This is the direction we are building toward with Armorer.

Armorer is a local control plane for AI agents. The goal is to make agent runs, tools, approvals, sandboxes, audit trails, and artifacts inspectable on your own machine instead of treating every agent as an opaque chat window.

Repo: https://github.com/ArmorerLabs/Armorer

The bet is simple:

As agents get more capable, the bottleneck moves from "can it do the task?" to "can I understand, govern, and repair what it did?"

That layer is still early.

But I think it is where a lot of practical agent engineering is heading.