DEV Community

Armorer Labs
Armorer Labs

Posted on

Multi-agent runs need a handoff receipt, not just a shared trace

When a single agent does something dangerous, the audit problem is small. You have one run, one set of tool calls, one receipt stream, and one place to ask who, what, and why.

When a team of agents works on the same task, the audit problem is suddenly much harder, and the most common reaction is to glue everything together with a shared log. That is usually the wrong answer.

The thing that breaks first in a multi-agent run

In our own work, the thing that breaks first is not tool correctness. It is the handoff.

Concretely: agent A reads a ticket, plans a fix, and decides that the actual file edits should be done by agent B because B has the right tooling and a tighter permission scope. A asks B to do the edits. B does them. The user wakes up the next morning, looks at the PR, and asks "who changed this and why?"

If A and B share a single trace stream, the answer is "well, A asked, and B did it, somewhere in the log." That is technically true, but it is not operationally useful. You cannot easily answer:

  • Which sub-agent produced the specific diff?
  • Which sub-agent's session held the write credential when the diff was applied?
  • If the diff is wrong, whose approval scope covered that exact write?
  • Where in the chain did an untrusted instruction get passed from A's context into B's prompt?

A shared trace hides these answers inside a single blob. A handoff receipt keeps them separate.

What a handoff receipt actually is

A handoff receipt is a small structured record produced at the moment one agent delegates work to another. At minimum it carries:

  • Parent run id and child run id
  • The exact task string the parent handed to the child (not a summary, the actual prompt)
  • The scope object the child inherited vs. the scope object the child used (often different, and the difference is the audit point)
  • The credential identity the child used to act (per-agent service account, scoped OAuth token, ephemeral key — whatever the runtime supports)
  • A pointer to the parent's reasoning trail at the moment of delegation, so reviewers can see what the parent was thinking when it chose this child for this task
  • A short list of policy decisions taken during the handoff: was the child's scope narrower than the parent's? Was the action reversible? Did the handoff itself require human approval under your tier rules?

The key idea is that the handoff is the seam between two distinct sessions, and the seam deserves its own record. If you only have a shared trace, the seam is invisible.

Why per-sub-agent session identity matters here

This builds directly on the per-agent session identity pattern we wrote about yesterday. If every sub-agent has its own credential, its own scope object, and its own receipt stream, then a handoff is the moment those identities are explicitly related — parent run id, child run id, inherited scope, actual scope. That relation is what lets you reconstruct the chain after the fact.

If your sub-agents share a single credential and a single scope, you cannot tell whose action produced which side effect. You can only tell "the agent did it," which collapses the audit trail into a single, hard-to-investigate blob.

Where this fits alongside a guard

A policy guard that runs at the tool-call boundary still has work to do. The handoff receipt is not a replacement for tool-call receipts. They are different layers:

  • Tool-call receipt: which capability was invoked, on which target, with which arguments, and what was the policy decision.
  • Handoff receipt: which sub-agent was created or invoked, with which scope, to satisfy which part of which parent task.

A guard that only sees tool calls can answer "did this MCP call get approved?" but cannot answer "why was this sub-agent allowed to make this call at all?" That second question is where the most interesting failures live in multi-agent systems: prompt injection in a parent's context contaminating a child's tool calls, scope drift where a child quietly uses a wider scope than it was handed, and approval theater where the parent "approved" something it never had the context to evaluate.

A starting pattern that does not require a fork

You do not need to build a full multi-agent runtime to get value out of this. A pragmatic starting point:

  • Give every sub-agent a stable id you can search for.
  • When a sub-agent is created or invoked, write one handoff record before its first tool call.
  • When the sub-agent finishes, write a close-out record that points back to the parent run id and the resulting side effects.
  • Treat the handoff record as a first-class artifact in your run history. Make it greppable. Make it part of your post-run review checklist.

That is not glamorous, but it is the difference between "we have a shared log somewhere" and "we can answer who did what."

An open question we are still working through

Where should the handoff record be produced? Three plausible places:

  • By the orchestrating parent, as part of its planning output.
  • By the runtime that hosts the sub-agent, at the moment the sub-agent is spawned.
  • By a shared control plane that both parent and child register with.

We are currently leaning toward the runtime, because the runtime is the one place that actually knows both sides of the seam and is the natural place to enforce per-sub-agent credential and scope separation. The orchestrating parent can narrate the handoff, but it should not be the authoritative source of truth — that way lies prompt injection.

If you have seen this work well in production, I would be curious where the handoff record lives in your stack.


Disclosure: This post is from Armorer Labs. We build Armorer, a local control plane for AI agents that runs on your machine or server, and Armorer Guard, a Rust scanner that runs policy at the tool-call boundary. The handoff-receipt pattern above is the same shape we use internally, but the post is operator-level guidance rather than a product announcement. Nothing here is a benchmark, customer count, or availability claim.

Top comments (0)