Everyone is comparing agent frameworks: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude Code, Codex, MCP routers, custom harnesses.
That comparison matters, but it misses the layer that starts hurting once the demo works.
The framework creates the workflow. It does not automatically answer:
- what is installed and running locally?
- which tools, MCP servers, skills, and providers are mounted?
- what repo, files, or workspace state were in scope?
- what did the agent change?
- which actions created side effects?
- which actions required approval, warning, redaction, block, or review?
- what evidence came from tests, evals, traces, or browser checks?
- what can be retried, resumed, rolled back, or cleaned up safely?
That is the layer we are building Armorer for: a local control plane around agents.
The split we are converging on:
- Armorer: sessions, jobs, tool inventory, config, approvals, run records, and recovery
- Armorer Guard: fast runtime decisions on proposed tool calls and model/tool-output transitions
The goal is not to replace agent frameworks. It is to make agents operable once they exist.
The artifact I keep coming back to is a run receipt.
A useful agent run receipt should capture:
- the agent/app, version, and config
- the mounted tools, MCP servers, skills, and providers
- the workspace/repo/files in scope
- checkpoints before and after the run
- tool calls and side effects
- approval and review decisions
- test/eval/check evidence
- retry, resume, rollback, and cleanup state
Without this, debugging agent runs turns into transcript archaeology.
With it, operating agents starts to feel more like operating software again.
Repos:
- Armorer: https://github.com/ArmorerLabs/Armorer
- Armorer Guard: https://github.com/ArmorerLabs/Armorer-Guard
Questions I would love feedback on:
- What is the minimum useful run receipt for an agent session?
- Which approval events should become first-class history?
- Where should MCP/tool metadata stop and runtime policy begin?
- What recovery action do you wish your agent harness exposed after a bad run?
Top comments (7)
I'd argue receipts and containment are two halves of the same maturity step: receipts tell you what the agent did, containment bounds what it could have done. Frameworks give you neither by default.
Curious if you've found a clean way to make receipts tamper-evident, that's where I keep getting stuck.
On "where should MCP/tool metadata stop and runtime policy begin" — the boundary gets clearer when tools return structured, verifiable payloads instead of prose.
On Helium MCP (finance/news), a news-bias call returns 37 numeric dimensions per source plus a corpus timestamp; a top-strategies call returns forecast %, lower/upper bounds, and a resolution date. Those map cleanly to run receipts: tool name, params, numeric outputs, corpus freshness.
We built browser dashboards on the same CORS-open REST endpoints for human review before trusting agent output:
connerlambden.github.io/helium-new...
Curious if you'd model receipt schema differently for numeric vs narrative tool outputs.
Great breakdown of the control plane gap. I ran into the same problem — frameworks handle the workflow but don't give you fine-grained control over when agents should think vs. act. Built a small hook-based plugin called Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) that uses PreToolUse hooks to block tool calls during ideation. Three modes (divergent, actionable, academic) keep the agent in the right headspace instead of jumping straight to execution. Feels like a natural extension of that runtime decision layer you described.
That is a nice example of the same boundary at a smaller scope. I like the “think vs act” split because it turns a fuzzy instruction into a runtime mode the agent has to respect. The next useful step might be recording which mode was active when a tool call was blocked, so the hook decision becomes reviewable later too.
On the minimum-useful question: the filter is whatever you can't reconstruct later. Half your eight items are recoverable after the fact, tool inventory, config, workspace state, you can re-derive most of that. The three you can't are the diff (what the agent changed), the before/after checkpoint (so you can actually roll back, not just read about it), and the approval/block decision together with the rule that produced it. That last pairing is the one teams drop: a receipt that records "approved" without recording what it was approved against is a receipt you can't audit, because you can't tell whether Guard checked the right thing or nothing at all. Which ties your two components together. Guard makes the decision; the receipt is only trustworthy if it captures the decision and its triggering rule, not just the outcome. Outcome without the rule is transcript archaeology with extra steps.
That filter is very helpful: whatever you cannot reconstruct later. I agree the approval/block result is not enough by itself; the triggering rule and the evaluated artifact need to travel with it. Otherwise the record proves a decision happened, but not whether the right decision procedure ran.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.