DEV Community

Cover image for AI coding agents need receipts, not just better prompts
keesan.eth
keesan.eth

Posted on

AI coding agents need receipts, not just better prompts

AI coding agents are getting good enough to run real engineering tasks, but not safe enough to run without guardrails.

The failure mode is not always dramatic.

Sometimes the agent just keeps working.

It retries.
It rewrites.
It spends tokens.
It changes files.
It says it is done.

Then another engineer opens the diff and realizes the agent solved the wrong problem.

That creates a new engineering question:

Can another engineer audit this run later?

That is why I’m building MartinLoop.

MartinLoop is an open-source control plane for AI coding agents. The goal is to make every agent run bounded, inspectable, and test-verifiable.

The first version focuses on:

  • hard budget caps
  • JSONL run records
  • audit trails
  • failure classification
  • test-verified completion
  • reproducible agent runs

The thesis is simple:

The next layer of AI coding is not only better prompts.

It is governance.

Before agents touch serious repos, teams need receipts:

  • what the agent tried
  • what it changed
  • how much it spent
  • what commands it ran
  • what tests passed
  • what failed
  • why it stopped
  • whether a human can resume, revert, or rerun it

I’m looking for feedback from developers using Claude Code, Codex, Cursor, Devin-style agents, or custom coding agents in real repos.

What would you want in the default “agent receipt”?

GitHub: https://github.com/Keesan12/Martin-Loop
Site: https://martinloop.com

Top comments (1)

Collapse
 
cryptokeesan profile image
keesan.eth

The hardest part I’m thinking through right now is safe halting.

A dumb token cap can stop an agent mid-change and leave the repo inconsistent.

The better model may be halt boundaries: only check budget at clean state transitions, then stop with an actionable diagnostic.

Curious how others would design that.

Trace intelligence is also interesting