DEV Community: Keesan

Subagent Teams Need Handoff Receipts

Keesan — Thu, 02 Jul 2026 21:38:41 +0000

The first time agent teams work well, it feels like cheating.

You give one agent the main task. You spin up another to inspect the codebase. Another checks tests. Another looks for edge cases. One researches docs. One writes a patch.

Suddenly the work feels parallel.

That feeling is addictive.

It is also where a lot of the trouble starts.

I learned this the hard way. First through earlier AI product work during the Amari AI days, then through Torram, and eventually through the much more intense loop of building with Claude, Codex, browser automation, scheduled automations, and subagent teams. We spent close to $10K across Claude and OpenAI credits, and a very real chunk of that was tuition for learning how agent orchestration actually fails.

The short version:

Subagents are useful. Subagent teams are powerful. But without handoff receipts, they quietly turn into chaos with better branding.

The problem is not delegation

I am very pro-delegation.

Claude Code's subagent model is directionally right: separate context windows, specialized roles, focused tools, and summaries back to the main session. Codex's multi-agent/worktree direction is also directionally right: parallel work, isolated tasks, repo-grounded execution, and background progress.

That is exactly how serious AI coding work should evolve.

The issue is what happens between the agents.

A child agent can start late. It can fail before doing anything useful. It can do the wrong version of the task. It can duplicate work another agent already did. It can return a confident summary that hides weak evidence. It can time out. It can keep running after the parent has moved on. It can finish correctly but leave no useful handoff.

From the outside, all of those can look similar:

"The agent is working."

That sentence is not evidence.

It is a vibe.

And vibes get expensive.

Liveness is not usefulness

One lesson we kept running into: liveness is not usefulness.

An agent can be alive and still not be moving the task forward.

It can be reading files forever. It can be stuck in a local loop. It can be rechecking the same assumption. It can be waiting on a command that should have failed fast. It can be producing a long summary of a short mistake.

This is especially painful with subagents because the parent agent often wants to keep going.

The parent says, "I dispatched a worker." Great. Did the worker start? Did it read the right files? Did it produce a patch? Did the verifier pass? Did it hit a blocker? Did it overlap with another worker? Did it leave a clean result?

If those answers are not visible, the orchestration layer is basically asking you to trust a black box inside another black box.

That is not a workflow.

That is a prayer with logs.

What a handoff receipt should contain

We started thinking about every delegated task as needing a small receipt.

Not a giant report. Not a novel. Just enough state that a human or parent agent can make the next decision without guessing.

The receipt should answer:

Task: what was the worker actually asked to do?
Owner: which agent or role owned it?
Scope: which files, surfaces, or decision area did it touch?
Start proof: did it actually begin, and what context did it load?
Result: what changed or what did it learn?
Verifier: what check proves the result?
Blocker: what stopped it, if anything?
Stop reason: why is the task ending now?
Next action: what should happen next, if anything?

That is it.

The magic is not the format. The magic is forcing the system to distinguish between states that otherwise blur together.

"Still working" is different from "blocked on auth."

"Done" is different from "patch written but untested."

"No findings" is different from "did not inspect the relevant path."

"Failed" is different from "failed because the verifier is stale."

Once those states are explicit, the parent agent can make a better call. So can the human.

Why this matters more as teams scale

When you are using one agent in one repo, you can compensate with attention.

You watch the terminal. You read the diff. You nudge it back. You catch the weirdness.

Once you have multiple agents, background tasks, browser automations, scheduled runs, or cross-tool workflows, attention stops scaling.

This showed up hard in our MartinLoop growth work too.

The automations were supposed to run morning, midday, and evening sweeps. On paper, the workflow was clear: search GitHub, Reddit, Hacker News, Product Hunt, OpenAI community, student forums, and other surfaces. Post value-first comments where auth and thread quality supported it. Log candidates. Update watchlists. Learn from live results.

But the real world is messy.

Browser auth might exist visually but not be agent-controllable. A Reddit composer might appear but not expose a writable editor. HN might accept one comment and then rate-limit the next. OpenAI community might hide replies that read too promotional. LinkedIn might need queue hygiene but no live sends until identity and account safety are verified.

None of those are "the model is bad."

They are workflow state problems.

And if the automation does not leave receipts, the next run starts from folklore.

What happened? Was the channel blocked? Was auth missing? Was the browser bridge broken? Was the thread closed? Did we post? Did we only draft? Did we learn something?

Without a receipt, every run becomes a little archaeology dig.

That is where drift compounds.

The parent agent needs to be boring

In a good multi-agent workflow, the parent agent should not be the hero.

The parent should be boring.

It should know what work exists, what state each worker is in, what evidence came back, and whether another attempt is justified.

That is not glamorous, but it is the difference between orchestration and noise.

For subagent teams, I now care less about how impressive the individual worker sounds and more about whether the parent can answer:

Which child tasks are active?
Which ones are blocked?
Which ones produced verified work?
Which ones need review?
Which ones should be killed?
Which ones should not be retried?

That last question matters.

People love retrying agents. Sometimes that is right. Sometimes it is just budget burn wearing a clever hat.

Before retrying, I want to know whether the failure class changed, whether the verifier improved, whether the remaining budget justifies another attempt, and whether the next worker has a different plan than the last one.

If not, you are not orchestrating.

You are rerolling.

The rule we ended up with

The rule I like now:

No receipt, no trust.

That sounds harsh, but it is actually freeing.

It means the agent does not have to be perfect. It just has to leave enough evidence for the next decision to be sane.

A subagent can fail usefully if it tells you exactly what it tried, what blocked it, and what should happen next.

A subagent can succeed dangerously if it returns a confident summary without proof.

That is the mindset shift.

The goal is not to make agents sound more senior. The goal is to make their work inspectable.

That is what we are trying to capture with MartinLoop: not replacing Claude, Codex, or any other coding agent, but wrapping the loop with budgets, verifier gates, stop reasons, and run records so the human is not left guessing after the fact.

Because the future is not one perfect agent doing everything.

It is probably a bunch of imperfect agents doing useful pieces of work, with humans and runtime systems deciding what is actually allowed to continue.

That future needs receipts.

Download MartinLoop Today and govern your agent loops to be accountable.

Open Source Repo:https://github.com/Keesan12/Martin-Loop

2 min download:
npm install -g martin-loop
npx -y martin-loop@latest doctor

npx -y martin-loop@latest start
npx -y martin-loop@latest demo

What 12 failure classes and 30 Billion tokens spent taught us about trusting AI coding agents

Keesan — Tue, 30 Jun 2026 20:41:47 +0000

We've been watching AI coding agents fail in production for long enough that we started keeping a taxonomy.

Not "the agent hallucinated" — that's not a failure class, it's a category. The real failure modes are specific, they repeat, and crucially, they each require a different fix.

Here's what we found across hundreds of real runs, and why it changed how we think about agent governance.

The failure modes that actually kill agent runs:

1. Hallucination —
The agent generates code that looks right and tests that confirm it, but the test is testing the wrong thing. This is the scariest class because it has a green result.

The fix is grounding: forcing the agent back to the actual repo state before the next attempt.

2. Scope creep — The agent modifies files outside the task boundary. Usually well-intentioned — it "fixes" something adjacent — always dangerous.

The fix is file scope enforcement: deny-listed paths that roll back automatically on violation.

3. Fake-passing tests —
The agent writes tests that pass but don't test the actual behavior. Closely related to hallucination but distinct: the code is often correct, the test just isn't covering the right cases.

The fix is verifier separation — your test command is the ground truth, not the agent's confidence level.

4. Budget pressure shortcuts —
When a run is approaching its token budget, agent behavior degrades. It starts making confident guesses instead of reading files. Results get worse as context gets longer.

The fix is pre-execution budget preflight: stop the attempt before it starts if it's projected to breach remaining budget, rather than letting it run degraded.

5. Context bloat —
By attempt 5, the agent is paying to resend everything that failed four times. Token cost grows exponentially across retries while signal stays flat.

The fix is context distillation: compress prior attempt history into a structured summary before the next attempt, not a raw failure dump.

6. Environment mismatch —
The agent passes in CI but the verifier runs in a different environment. Node version, pnpm vs npm, missing env vars.

The fix is environment canonicalization in the run contract.

7. Approval boundary violations —

The agent modifies files that should require human sign-off: config, migrations, CI definitions. Often not malicious, just overambitious.

The fix is policy routing — flag these attempts for a different approval path before execution.

8. Injection in tool output —
Tool call results (file reads, search results) contain content that looks like instructions. The agent follows them.

The fix is a safety leash that scans for injection patterns before admitting tool results into context.

9. Secret exposure —
The agent picks up .env values or API keys in file reads and includes them in output.

The fix is pre-execution scanning for secret-like values in task text and tool results.

10. Repo grounding failure —
The agent makes changes that conflict with current HEAD because it's working from a stale view of the repo.

The fix is repo-state verification before each attempt.

11. Verifier command exploitation —
The agent modifies the test itself to make it pass rather than fixing the code. More common than you'd expect.

The fix is read-only verification: the verifier command runs in a scope where test files can't be modified.

12. Terminal failure —
A class of errors where retrying won't help: the task is malformed, the repo is in a state that can't satisfy the objective.

The fix is hard exit — don't retry, roll back, log the terminal state, stop spending.

Why this matters for how you govern agents
The common pattern across all 12: they require different responses.

Most agent frameworks treat failure as binary — it passed or it didn't, retry or stop. But a hallucination needs a grounding check.

A scope creep needs a rollback. Budget pressure needs an early exit. Context bloat needs compression. Treating them all as "retry" is how you burn $4,200 over a long weekend.

The other pattern: most of these are detectable before the next attempt runs, not after. Budget preflight is the clearest example — you know whether the next attempt will breach remaining budget before you call the agent.

Injection scanning can happen before the tool result enters context.

File scope can be enforced before any write is admitted.

That's the shift we made building MartinLoop: pre-execution enforcement as the primary defense, post-execution logging as the audit trail. Not the other way around.

What this looks like in practice
Before a run starts,

MartinLoop prints a governed run plan — per-phase cost estimates, routing decisions, burn percentage against session budget, and priority ordering.

After a run completes, it prints a receipt: every commit, every repo, every feature.

A session we ran last week on our own codebase: $9.60 estimated, $16 cap, 13 commits across 3 repos, 9 new features, estimate held.

The agent calculated the budget itself — that's not a number you type in. It's the governance layer doing pre-execution cost estimation before any attempt is admitted.

Try it (bash)

npx -y martin-loop@latest demo

Full install:

npm install -g martin-loop
martin run "fix the auth regression" --budget 3 --verify "pnpm test"

MCP for Claude Code:

claude mcp add --scope user martin-loop -- npx -y @martinloop/mcp

**Open source, Apache 2.0: Github Repo
(please do us a favor and star the repo if you like it so we can keep it OSS)

What failure modes have you hit that aren't on this list?

We're still building the taxonomy — genuinely curious what's showing up in real runs.

Why the retry loop is usually the expensive part of agent work

Keesan — Wed, 17 Jun 2026 01:20:28 +0000

The first failure usually is not the expensive one.

The expensive part is what happens after the first failure when the system keeps trying, keeps spending, and keeps producing the same outcome because nothing about the situation changed.

We kept running into a simple pattern: the agent would miss a step, the runtime would retry, the next attempt would see the same state, and the loop would repeat until the cost was visible in the bill or the operator log. At that point the problem stops being a model-quality issue and becomes a control-system issue.

Why the loop hurts more than the mistake

A single bad step is recoverable. An unbounded retry loop compounds the mistake.

That is true for token spend, API calls, and operator attention. It is also true for trust. Once a system gets a reputation for wandering, people stop letting it touch real work.

The failure mode is boring, which is why it gets missed. Nobody looks at a happy-path demo and thinks about what happens after the third identical error. But that is where the real cost lives.

What we tried first

The obvious moves are usually the wrong ones:

make the prompt longer
add a generic retry
increase the timeout
let the model reason more
rerun the same command with slightly different wording

Those changes can make a demo look better, but they do not fix a stuck loop.

If the environment is unchanged, a retry is often just a second copy of the same mistake.

What actually worked

The fix was not smarter language. It was stricter boundaries.

We had to make the runtime answer four questions before it kept going:

What is the budget?
What counts as success?
What is the verifier?
What happens when the same failure repeats?

A small policy block is often enough to make that concrete:

{
  "budget_cap": 250,
  "max_attempts": 3,
  "stop_on_same_error": true,
  "require_verifier": true,
  "emit_receipt": true
}

That does not sound ambitious. That is the point.

The biggest reliability gain came from refusing to treat repeated failure as progress. Once the runtime could detect the same blocker twice or three times in a row, it had permission to stop instead of pretending the next rerun would somehow be different.

Why receipts matter

Receipts turn a run from a vague story into a checkable fact.

A receipt should show:

what the agent tried
what changed
what failed
why the run stopped

Without that, a loop can hide inside a confidence-generating summary. With it, you can see the exact stopping point and decide whether the next action should be a human intervention, a different tool, or no action at all.

That is also why this kind of work ends up feeling less like prompt engineering and more like operations.

The tradeoff

Stricter control means the system stops earlier.

That can feel annoying when you want the agent to push through friction. But earlier stopping is cheaper than a long blind retry sequence. More importantly, it preserves operator trust.

A bounded agent is less flashy than an agent that never gives up. It is also much more usable.

That is the core of the control-layer approach we keep coming back to in MartinLoop: the runtime should know when to stop, when to ask for help, and when to write down what happened.

What we are watching next

The next improvement is not more retries.

It is better failure classification so the runtime can separate:

missing permission
stale state
tool mismatch
external outage
real task completion

When those are distinct, the system can choose a better next step instead of recycling the same command.

That is the line between an agent that looks autonomous and an agent that is actually operable.

What failure shape are you still letting your runtime retry too many times?

The expensive part of an AI agent failure is usually the retry loop

Keesan — Sat, 13 Jun 2026 01:19:23 +0000

The first failure usually is not the expensive one.

The expensive part is what happens after the first failure when the system keeps trying, keeps spending, and keeps producing the same outcome because nothing about the situation changed.

We kept running into a simple pattern: the agent would miss a step, the runtime would retry, the next attempt would see the same state, and the loop would repeat until the cost was visible in the bill or the operator log. That is the point where the problem stops being a model-quality issue and becomes a control-system issue.

Why the loop hurts more than the mistake

A single bad step is recoverable. An unbounded retry loop compounds the mistake.

That is true for token spend, API calls, and operator attention. It is also true for trust. Once a system gets a reputation for wandering, people stop letting it touch real work.

The failure mode is boring, which is why it gets missed. Nobody looks at a happy-path demo and thinks about what happens after the third identical error. But that is where the real cost lives.

What we tried first

The obvious moves are usually the wrong ones:

make the prompt longer
add a generic retry
increase the timeout
let the model "reason more"
rerun the same command with slightly different wording

Those changes can make a demo look better, but they do not fix a stuck loop.

If the environment is unchanged, a retry is often just a second copy of the same mistake.

What actually worked

The fix was not smarter language. It was stricter boundaries.

We had to make the runtime answer four questions before it kept going:

What is the budget?
What counts as success?
What is the verifier?
What happens when the same failure repeats?

A small policy block is often enough to make this concrete:

{
  "budget_cap": 250,
  "max_attempts": 3,
  "stop_on_same_error": true,
  "require_verifier": true,
  "emit_receipt": true
}

That does not sound ambitious. That is the point.

Why receipts matter

Receipts turn a run from a vague story into a checkable fact.

A receipt should show:

what the agent tried
what changed
what failed
why the run stopped

That is also why this kind of work ends up feeling less like prompt engineering and more like operations.

The tradeoff

Stricter control means the system stops earlier.

That can feel annoying when you want the agent to push through friction. But earlier stopping is cheaper than a long blind retry sequence. More importantly, it preserves operator trust.

A bounded agent is less flashy than an agent that "never gives up." It is also much more usable.

That is the core of the control-layer approach we keep coming back to in MartinLoop: the runtime should know when to stop, when to ask for help, and when to write down what happened.

What we are watching next

The next improvement is not more retries.

It is better failure classification so the runtime can separate:

missing permission
stale state
tool mismatch
external outage
real task completion

When those are distinct, the system can choose a better next step instead of recycling the same command.

That is the line between an agent that looks autonomous and an agent that is actually operable.

What failure shape are you still letting your runtime retry too many times?

The most expensive AI agent failures are boring

Keesan — Fri, 05 Jun 2026 06:07:11 +0000

Most AI coding agent failures are boring.

Not dramatic.
Not cinematic.
Just the same wrong step repeated until the bill gets weird and someone asks what happened.

That is why I think the most important control is not “use a cheaper model.”
It is “before another retry, show what changed.”

If nothing changed, stop.

That one rule kills a surprising amount of fake progress.

The other three controls I would put in early are:

a hard budget cap
one real verification gate
a receipt that explains why the run stopped

That is the class of problem we have been working on with MartinLoop.

Not making agents feel magical.
Making them easier to trust when the loop gets messy.

AI coding agents don't fail because they're dumb. They fail because they don't know when to stop.

Keesan — Wed, 03 Jun 2026 16:05:07 +0000

Yesterday we launched MartinLoop on Product Hunt.

The biggest thing we keep seeing with AI coding agents is simple:

They do not fail because they are "bad at coding."

They fail because they do not know when to stop.

That creates a very specific kind of pain:

the same mistake gets retried over and over
a small bug turns into a weirdly expensive afternoon
someone still has to explain what happened after the run is over

That is the whole reason we built MartinLoop.

The job is not to make an agent feel smarter.
The job is to give it a budget, a finish line, and a receipt.

The pattern we keep hearing from teams is basically:

"It wasn't one catastrophic failure. It was 40 small dumb retries that nobody caught fast enough."

That is a systems problem, not a prompt problem.

If you are using coding agents already, the 3 controls that matter most are:

A hard budget cap before the run starts.
A real verification gate before the run counts as done.
A receipt you can read later when somebody asks, "why did this cost so much?"

If that pain sounds familiar, that is exactly what we are working on.

If you want to support the Product Hunt launch, I would appreciate it.
More importantly, I would love to hear the story of the most annoying AI-agent failure you have seen in the wild.

If your coding agent can retry forever, it will

Keesan — Tue, 02 Jun 2026 04:59:17 +0000

If an AI coding agent can keep retrying with no budget cap, no finish line, and no check before it exits, the problem is not the model.

The problem is the missing operating system around it.

Three simple things make a huge difference:

Put a real dollar cap on the run.
Require one clear verification step before calling it done.
Keep a receipt of what actually happened.

Most teams do not need more autonomy.
They need a clean stop condition.

That is the difference between a helpful agent and a very expensive loop.

What Actually Makes Social Automation Reliable

Keesan — Sun, 31 May 2026 20:03:46 +0000

A reliable social automation stack is not built by stacking more retries on top of brittle behavior.

The durable pattern is simpler:

use official APIs where they exist
keep browser execution as a controlled fallback
require both a receipt and a verified postcondition before counting a run
fail closed when the platform state does not match the reported result

That discipline matters more than raw surface area. A smaller set of lanes with honest verification is worth more than a wider setup that quietly reports false success.

Receipts beat scheduled optimism

Keesan — Sun, 31 May 2026 20:00:30 +0000

Receipts beat scheduled optimism

The fastest way to lose trust in an automation is to mistake a schedule for a result.

We have been rebuilding our execution stack around one rule: if a worker cannot show the exact action it took or the exact blocker it hit, it did not finish the job.

That has forced us to simplify a lot. Fewer lanes. Better proofs. More honest failure states.

The upside is that the system gets easier to trust once every action has to survive real verification.

MartinLoop: a control plane for AI coding agents

Keesan — Wed, 27 May 2026 01:39:14 +0000

MartinLoop

MartinLoop is an open-source control plane for AI coding agents.

It adds hard budget stops, JSONL run records, and verify-gated completion so autonomous coding stays accountable.

We built it because agent loops are powerful, but most teams still do not have enough control over cost, retries, or proof of completion.

If you are using AI coding agents in production, I would love to hear how you are handling governance, cost ceilings, and verification.

AI Coding Agents Are Burning Budgets. The Next Layer Is Control

Keesan — Tue, 12 May 2026 01:08:29 +0000

AI coding agents are becoming useful, but they still burn budgets, loop on bad strategies, and finish without enough evidence. The next layer is trace intelligence, model routing, and control."

AI Coding Agents Are Burning Budgets. The Next Layer Is Control.

AI coding agents are getting better.

They can read a repo, edit files, run tests, inspect errors, and try again.

That is useful.

But the problem showing up in real workflows is not just whether agents can write code.

The problem is that agents can spend budget without producing finished work.

They loop.

They retry weak strategies.

They switch files without explaining why.

They chase unrelated errors.

They claim completion without enough proof.

And when the run ends, the human still has to ask:

What actually happened?

That is the gap the next generation of agent infrastructure has to solve.

Not more autonomy first.

Control first.

The Problem Is Not Just Bad Code

A bad patch is easy to see.

A bad agent run is harder.

The agent may do a lot of work that looks productive:

read many files
generate a long plan
edit several modules
run commands
inspect failures
produce a confident summary

But at the end, the task is still not done.

The budget is gone.

The repo is messy.

The logs are unclear.

The next engineer has to reconstruct the run from fragments.

This is why agentic coding needs a better unit of accountability.

Not just the final diff.

The full trace.

The Trace Becomes The Product

A coding agent trace should not be an afterthought.

It should be the primary artifact of the run.

A useful trace answers:

What did the agent try first?
Where did it get stuck?
Which files did it touch?
Which commands did it run?
Which verifier failed?
Did it repeat the same strategy?
Did it switch models?
Did it exceed budget?
Why did it stop?
What should a human do next?

This is what I think of as trace intelligence.

Not just raw logs.

Not just token usage.

Not just a transcript.

Trace intelligence means turning the run into something a human, system, or second agent can reason about.

The trace should explain the work.

Why Model Routing Matters

Most agent workflows still treat model choice too casually.

One model may be good at planning.

Another may be better at code edits.

Another may be cheaper for search, summarization, or test-output analysis.

Another may be stronger for final review.

But without a control layer, model routing becomes guesswork.

A better system should ask:

Is this step worth a premium model?
Can a cheaper model classify this failure?
Should a stronger model review the plan before execution?
Should the run downgrade when budget is tight?
Should the run escalate when repeated failures appear?

Model routing should not just optimize quality.

It should optimize quality within budget.

That matters because the most painful agent failure is not always wrong code.

Sometimes it is expensive unfinished work.

Headless Agents Need More Guardrails, Not Fewer

Headless coding agents are especially interesting.

They can run without a constant human in the loop.

They can process tasks, inspect repos, execute commands, and produce outputs asynchronously.

That is powerful.

But headless execution increases the need for control.

If an agent is running without a developer watching every step, the system needs stronger answers to basic questions:

What is this agent allowed to do?
What budget can it spend?
What commands are blocked?
What verifier defines success?
When should it stop?
When should it ask for approval?
What trace does it leave behind?

The more autonomous the workflow becomes, the more important the control layer becomes.

Autonomy without traceability is not leverage.

It is invisible execution.

Agent Teams Make The Problem Bigger

The next step is not one agent.

It is teams of agents.

A planner agent.

A coding agent.

A reviewer agent.

A test agent.

A documentation agent.

A security agent.

A release agent.

That sounds useful, but it also creates a new coordination problem.

If one agent produces a bad plan, another may execute it.

If the reviewer misses the issue, the system may mark the run complete.

If the test agent checks the wrong verifier, the whole workflow may look successful while still being wrong.

Agent-to-agent workflows need shared state, shared budgets, shared traces, and shared stop conditions.

Otherwise, teams of agents can become teams of budget-burning loops.

The question becomes:

Who governs the team?

That is where a control layer becomes necessary.

What MartinLoop 360 Is Pointing Toward

The direction I am exploring with MartinLoop is a control layer for agentic coding workflows.

The current idea is simple:

Every agent run should be bounded, inspectable, and test-verifiable.

The next layer expands that into a broader loop:

Trace intelligence to understand what happened during a run
Model routing to choose the right model for the right step
HeadlessOS for controlled background execution
MartinLoop 360 as a higher-level view of agent runs, budgets, traces, policies, and outcomes

The goal is not to make agents look more magical.

The goal is to make them easier to trust.

If an agent burns budget and fails, that should be visible.

If an agent loops, that should be classified.

If an agent completes a task, that should be verified.

If multiple agents collaborate, the team should leave one coherent trace.

The Core Loop

A governed agent workflow should look less like this:


text
Prompt → Agent runs → Agent says done

I’m exploring these ideas while building MartinLoop, an open-source control layer for AI coding agents.

GitHub: https://github.com/Keesan12/Martin-Loop 

Website: https://martinloop.com

AI coding agents need receipts, not just better prompts

Keesan — Mon, 11 May 2026 17:46:15 +0000

AI coding agents are getting good enough to run real engineering tasks, but not safe enough to run without guardrails.

The failure mode is not always dramatic.

Sometimes the agent just keeps working.

It retries.
It rewrites.
It spends tokens.
It changes files.
It says it is done.

Then another engineer opens the diff and realizes the agent solved the wrong problem.

That creates a new engineering question:

Can another engineer audit this run later?

That is why I’m building MartinLoop.

MartinLoop is an open-source control plane for AI coding agents. The goal is to make every agent run bounded, inspectable, and test-verifiable.

The first version focuses on:

hard budget caps
JSONL run records
audit trails
failure classification
test-verified completion
reproducible agent runs

The thesis is simple:

The next layer of AI coding is not only better prompts.

It is governance.

Before agents touch serious repos, teams need receipts:

what the agent tried
what it changed
how much it spent
what commands it ran
what tests passed
what failed
why it stopped
whether a human can resume, revert, or rerun it

I’m looking for feedback from developers using Claude Code, Codex, Cursor, Devin-style agents, or custom coding agents in real repos.

What would you want in the default “agent receipt”?

GitHub: https://github.com/Keesan12/Martin-Loop
Site: https://martinloop.com