Keesan

Posted on Jul 2

Subagent Teams Need Handoff Receipts

#ai #devops #opensource #programming

The first time agent teams work well, it feels like cheating.

You give one agent the main task. You spin up another to inspect the codebase. Another checks tests. Another looks for edge cases. One researches docs. One writes a patch.

Suddenly the work feels parallel.

That feeling is addictive.

It is also where a lot of the trouble starts.

I learned this the hard way. First through earlier AI product work during the Amari AI days, then through Torram, and eventually through the much more intense loop of building with Claude, Codex, browser automation, scheduled automations, and subagent teams. We spent close to $10K across Claude and OpenAI credits, and a very real chunk of that was tuition for learning how agent orchestration actually fails.

The short version:

Subagents are useful. Subagent teams are powerful. But without handoff receipts, they quietly turn into chaos with better branding.

The problem is not delegation

I am very pro-delegation.

Claude Code's subagent model is directionally right: separate context windows, specialized roles, focused tools, and summaries back to the main session. Codex's multi-agent/worktree direction is also directionally right: parallel work, isolated tasks, repo-grounded execution, and background progress.

That is exactly how serious AI coding work should evolve.

The issue is what happens between the agents.

A child agent can start late. It can fail before doing anything useful. It can do the wrong version of the task. It can duplicate work another agent already did. It can return a confident summary that hides weak evidence. It can time out. It can keep running after the parent has moved on. It can finish correctly but leave no useful handoff.

From the outside, all of those can look similar:

"The agent is working."

That sentence is not evidence.

It is a vibe.

And vibes get expensive.

Liveness is not usefulness

One lesson we kept running into: liveness is not usefulness.

An agent can be alive and still not be moving the task forward.

It can be reading files forever. It can be stuck in a local loop. It can be rechecking the same assumption. It can be waiting on a command that should have failed fast. It can be producing a long summary of a short mistake.

This is especially painful with subagents because the parent agent often wants to keep going.

The parent says, "I dispatched a worker." Great. Did the worker start? Did it read the right files? Did it produce a patch? Did the verifier pass? Did it hit a blocker? Did it overlap with another worker? Did it leave a clean result?

If those answers are not visible, the orchestration layer is basically asking you to trust a black box inside another black box.

That is not a workflow.

That is a prayer with logs.

What a handoff receipt should contain

We started thinking about every delegated task as needing a small receipt.

Not a giant report. Not a novel. Just enough state that a human or parent agent can make the next decision without guessing.

The receipt should answer:

Task: what was the worker actually asked to do?
Owner: which agent or role owned it?
Scope: which files, surfaces, or decision area did it touch?
Start proof: did it actually begin, and what context did it load?
Result: what changed or what did it learn?
Verifier: what check proves the result?
Blocker: what stopped it, if anything?
Stop reason: why is the task ending now?
Next action: what should happen next, if anything?

That is it.

The magic is not the format. The magic is forcing the system to distinguish between states that otherwise blur together.

"Still working" is different from "blocked on auth."

"Done" is different from "patch written but untested."

"No findings" is different from "did not inspect the relevant path."

"Failed" is different from "failed because the verifier is stale."

Once those states are explicit, the parent agent can make a better call. So can the human.

Why this matters more as teams scale

When you are using one agent in one repo, you can compensate with attention.

You watch the terminal. You read the diff. You nudge it back. You catch the weirdness.

Once you have multiple agents, background tasks, browser automations, scheduled runs, or cross-tool workflows, attention stops scaling.

This showed up hard in our MartinLoop growth work too.

The automations were supposed to run morning, midday, and evening sweeps. On paper, the workflow was clear: search GitHub, Reddit, Hacker News, Product Hunt, OpenAI community, student forums, and other surfaces. Post value-first comments where auth and thread quality supported it. Log candidates. Update watchlists. Learn from live results.

But the real world is messy.

Browser auth might exist visually but not be agent-controllable. A Reddit composer might appear but not expose a writable editor. HN might accept one comment and then rate-limit the next. OpenAI community might hide replies that read too promotional. LinkedIn might need queue hygiene but no live sends until identity and account safety are verified.

None of those are "the model is bad."

They are workflow state problems.

And if the automation does not leave receipts, the next run starts from folklore.

What happened? Was the channel blocked? Was auth missing? Was the browser bridge broken? Was the thread closed? Did we post? Did we only draft? Did we learn something?

Without a receipt, every run becomes a little archaeology dig.

That is where drift compounds.

The parent agent needs to be boring

In a good multi-agent workflow, the parent agent should not be the hero.

The parent should be boring.

It should know what work exists, what state each worker is in, what evidence came back, and whether another attempt is justified.

That is not glamorous, but it is the difference between orchestration and noise.

For subagent teams, I now care less about how impressive the individual worker sounds and more about whether the parent can answer:

Which child tasks are active?
Which ones are blocked?
Which ones produced verified work?
Which ones need review?
Which ones should be killed?
Which ones should not be retried?

That last question matters.

People love retrying agents. Sometimes that is right. Sometimes it is just budget burn wearing a clever hat.

Before retrying, I want to know whether the failure class changed, whether the verifier improved, whether the remaining budget justifies another attempt, and whether the next worker has a different plan than the last one.

If not, you are not orchestrating.

You are rerolling.

The rule we ended up with

The rule I like now:

No receipt, no trust.

That sounds harsh, but it is actually freeing.

It means the agent does not have to be perfect. It just has to leave enough evidence for the next decision to be sane.

A subagent can fail usefully if it tells you exactly what it tried, what blocked it, and what should happen next.

A subagent can succeed dangerously if it returns a confident summary without proof.

That is the mindset shift.

The goal is not to make agents sound more senior. The goal is to make their work inspectable.

That is what we are trying to capture with MartinLoop: not replacing Claude, Codex, or any other coding agent, but wrapping the loop with budgets, verifier gates, stop reasons, and run records so the human is not left guessing after the fact.

Because the future is not one perfect agent doing everything.

It is probably a bunch of imperfect agents doing useful pieces of work, with humans and runtime systems deciding what is actually allowed to continue.