DEV Community: keesan.eth

Why the retry loop is usually the expensive part of agent work

keesan.eth — Wed, 17 Jun 2026 01:20:28 +0000

The first failure usually is not the expensive one.

The expensive part is what happens after the first failure when the system keeps trying, keeps spending, and keeps producing the same outcome because nothing about the situation changed.

We kept running into a simple pattern: the agent would miss a step, the runtime would retry, the next attempt would see the same state, and the loop would repeat until the cost was visible in the bill or the operator log. At that point the problem stops being a model-quality issue and becomes a control-system issue.

Why the loop hurts more than the mistake

A single bad step is recoverable. An unbounded retry loop compounds the mistake.

That is true for token spend, API calls, and operator attention. It is also true for trust. Once a system gets a reputation for wandering, people stop letting it touch real work.

The failure mode is boring, which is why it gets missed. Nobody looks at a happy-path demo and thinks about what happens after the third identical error. But that is where the real cost lives.

What we tried first

The obvious moves are usually the wrong ones:

make the prompt longer
add a generic retry
increase the timeout
let the model reason more
rerun the same command with slightly different wording

Those changes can make a demo look better, but they do not fix a stuck loop.

If the environment is unchanged, a retry is often just a second copy of the same mistake.

What actually worked

The fix was not smarter language. It was stricter boundaries.

We had to make the runtime answer four questions before it kept going:

What is the budget?
What counts as success?
What is the verifier?
What happens when the same failure repeats?

A small policy block is often enough to make that concrete:

{
  "budget_cap": 250,
  "max_attempts": 3,
  "stop_on_same_error": true,
  "require_verifier": true,
  "emit_receipt": true
}

That does not sound ambitious. That is the point.

The biggest reliability gain came from refusing to treat repeated failure as progress. Once the runtime could detect the same blocker twice or three times in a row, it had permission to stop instead of pretending the next rerun would somehow be different.

Why receipts matter

Receipts turn a run from a vague story into a checkable fact.

A receipt should show:

what the agent tried
what changed
what failed
why the run stopped

Without that, a loop can hide inside a confidence-generating summary. With it, you can see the exact stopping point and decide whether the next action should be a human intervention, a different tool, or no action at all.

That is also why this kind of work ends up feeling less like prompt engineering and more like operations.

The tradeoff

Stricter control means the system stops earlier.

That can feel annoying when you want the agent to push through friction. But earlier stopping is cheaper than a long blind retry sequence. More importantly, it preserves operator trust.

A bounded agent is less flashy than an agent that never gives up. It is also much more usable.

That is the core of the control-layer approach we keep coming back to in MartinLoop: the runtime should know when to stop, when to ask for help, and when to write down what happened.

What we are watching next

The next improvement is not more retries.

It is better failure classification so the runtime can separate:

missing permission
stale state
tool mismatch
external outage
real task completion

When those are distinct, the system can choose a better next step instead of recycling the same command.

That is the line between an agent that looks autonomous and an agent that is actually operable.

What failure shape are you still letting your runtime retry too many times?

The expensive part of an AI agent failure is usually the retry loop

keesan.eth — Sat, 13 Jun 2026 01:19:23 +0000

The first failure usually is not the expensive one.

The expensive part is what happens after the first failure when the system keeps trying, keeps spending, and keeps producing the same outcome because nothing about the situation changed.

We kept running into a simple pattern: the agent would miss a step, the runtime would retry, the next attempt would see the same state, and the loop would repeat until the cost was visible in the bill or the operator log. That is the point where the problem stops being a model-quality issue and becomes a control-system issue.

Why the loop hurts more than the mistake

A single bad step is recoverable. An unbounded retry loop compounds the mistake.

That is true for token spend, API calls, and operator attention. It is also true for trust. Once a system gets a reputation for wandering, people stop letting it touch real work.

The failure mode is boring, which is why it gets missed. Nobody looks at a happy-path demo and thinks about what happens after the third identical error. But that is where the real cost lives.

What we tried first

The obvious moves are usually the wrong ones:

make the prompt longer
add a generic retry
increase the timeout
let the model "reason more"
rerun the same command with slightly different wording

Those changes can make a demo look better, but they do not fix a stuck loop.

If the environment is unchanged, a retry is often just a second copy of the same mistake.

What actually worked

The fix was not smarter language. It was stricter boundaries.

We had to make the runtime answer four questions before it kept going:

What is the budget?
What counts as success?
What is the verifier?
What happens when the same failure repeats?

A small policy block is often enough to make this concrete:

{
  "budget_cap": 250,
  "max_attempts": 3,
  "stop_on_same_error": true,
  "require_verifier": true,
  "emit_receipt": true
}

That does not sound ambitious. That is the point.

Why receipts matter

Receipts turn a run from a vague story into a checkable fact.

A receipt should show:

what the agent tried
what changed
what failed
why the run stopped

That is also why this kind of work ends up feeling less like prompt engineering and more like operations.

The tradeoff

Stricter control means the system stops earlier.

That can feel annoying when you want the agent to push through friction. But earlier stopping is cheaper than a long blind retry sequence. More importantly, it preserves operator trust.

A bounded agent is less flashy than an agent that "never gives up." It is also much more usable.

That is the core of the control-layer approach we keep coming back to in MartinLoop: the runtime should know when to stop, when to ask for help, and when to write down what happened.

What we are watching next

The next improvement is not more retries.

It is better failure classification so the runtime can separate:

missing permission
stale state
tool mismatch
external outage
real task completion

When those are distinct, the system can choose a better next step instead of recycling the same command.

That is the line between an agent that looks autonomous and an agent that is actually operable.

What failure shape are you still letting your runtime retry too many times?

The most expensive AI agent failures are boring

keesan.eth — Fri, 05 Jun 2026 06:07:11 +0000

Most AI coding agent failures are boring.

Not dramatic.
Not cinematic.
Just the same wrong step repeated until the bill gets weird and someone asks what happened.

That is why I think the most important control is not “use a cheaper model.”
It is “before another retry, show what changed.”

If nothing changed, stop.

That one rule kills a surprising amount of fake progress.

The other three controls I would put in early are:

a hard budget cap
one real verification gate
a receipt that explains why the run stopped

That is the class of problem we have been working on with MartinLoop.

Not making agents feel magical.
Making them easier to trust when the loop gets messy.

AI coding agents don't fail because they're dumb. They fail because they don't know when to stop.

keesan.eth — Wed, 03 Jun 2026 16:05:07 +0000

Yesterday we launched MartinLoop on Product Hunt.

The biggest thing we keep seeing with AI coding agents is simple:

They do not fail because they are "bad at coding."

They fail because they do not know when to stop.

That creates a very specific kind of pain:

the same mistake gets retried over and over
a small bug turns into a weirdly expensive afternoon
someone still has to explain what happened after the run is over

That is the whole reason we built MartinLoop.

The job is not to make an agent feel smarter.
The job is to give it a budget, a finish line, and a receipt.

The pattern we keep hearing from teams is basically:

"It wasn't one catastrophic failure. It was 40 small dumb retries that nobody caught fast enough."

That is a systems problem, not a prompt problem.

If you are using coding agents already, the 3 controls that matter most are:

A hard budget cap before the run starts.
A real verification gate before the run counts as done.
A receipt you can read later when somebody asks, "why did this cost so much?"

If that pain sounds familiar, that is exactly what we are working on.

If you want to support the Product Hunt launch, I would appreciate it.
More importantly, I would love to hear the story of the most annoying AI-agent failure you have seen in the wild.

If your coding agent can retry forever, it will

keesan.eth — Tue, 02 Jun 2026 04:59:17 +0000

If an AI coding agent can keep retrying with no budget cap, no finish line, and no check before it exits, the problem is not the model.

The problem is the missing operating system around it.

Three simple things make a huge difference:

Put a real dollar cap on the run.
Require one clear verification step before calling it done.
Keep a receipt of what actually happened.

Most teams do not need more autonomy.
They need a clean stop condition.

That is the difference between a helpful agent and a very expensive loop.

What Actually Makes Social Automation Reliable

keesan.eth — Sun, 31 May 2026 20:03:46 +0000

A reliable social automation stack is not built by stacking more retries on top of brittle behavior.

The durable pattern is simpler:

use official APIs where they exist
keep browser execution as a controlled fallback
require both a receipt and a verified postcondition before counting a run
fail closed when the platform state does not match the reported result

That discipline matters more than raw surface area. A smaller set of lanes with honest verification is worth more than a wider setup that quietly reports false success.

Receipts beat scheduled optimism

keesan.eth — Sun, 31 May 2026 20:00:30 +0000

Receipts beat scheduled optimism

The fastest way to lose trust in an automation is to mistake a schedule for a result.

We have been rebuilding our execution stack around one rule: if a worker cannot show the exact action it took or the exact blocker it hit, it did not finish the job.

That has forced us to simplify a lot. Fewer lanes. Better proofs. More honest failure states.

The upside is that the system gets easier to trust once every action has to survive real verification.

MartinLoop: a control plane for AI coding agents

keesan.eth — Wed, 27 May 2026 01:39:14 +0000

MartinLoop

MartinLoop is an open-source control plane for AI coding agents.

It adds hard budget stops, JSONL run records, and verify-gated completion so autonomous coding stays accountable.

We built it because agent loops are powerful, but most teams still do not have enough control over cost, retries, or proof of completion.

If you are using AI coding agents in production, I would love to hear how you are handling governance, cost ceilings, and verification.

AI Coding Agents Are Burning Budgets. The Next Layer Is Control

keesan.eth — Tue, 12 May 2026 01:08:29 +0000

AI coding agents are becoming useful, but they still burn budgets, loop on bad strategies, and finish without enough evidence. The next layer is trace intelligence, model routing, and control."

AI Coding Agents Are Burning Budgets. The Next Layer Is Control.

AI coding agents are getting better.

They can read a repo, edit files, run tests, inspect errors, and try again.

That is useful.

But the problem showing up in real workflows is not just whether agents can write code.

The problem is that agents can spend budget without producing finished work.

They loop.

They retry weak strategies.

They switch files without explaining why.

They chase unrelated errors.

They claim completion without enough proof.

And when the run ends, the human still has to ask:

What actually happened?

That is the gap the next generation of agent infrastructure has to solve.

Not more autonomy first.

Control first.

The Problem Is Not Just Bad Code

A bad patch is easy to see.

A bad agent run is harder.

The agent may do a lot of work that looks productive:

read many files
generate a long plan
edit several modules
run commands
inspect failures
produce a confident summary

But at the end, the task is still not done.

The budget is gone.

The repo is messy.

The logs are unclear.

The next engineer has to reconstruct the run from fragments.

This is why agentic coding needs a better unit of accountability.

Not just the final diff.

The full trace.

The Trace Becomes The Product

A coding agent trace should not be an afterthought.

It should be the primary artifact of the run.

A useful trace answers:

What did the agent try first?
Where did it get stuck?
Which files did it touch?
Which commands did it run?
Which verifier failed?
Did it repeat the same strategy?
Did it switch models?
Did it exceed budget?
Why did it stop?
What should a human do next?

This is what I think of as trace intelligence.

Not just raw logs.

Not just token usage.

Not just a transcript.

Trace intelligence means turning the run into something a human, system, or second agent can reason about.

The trace should explain the work.

Why Model Routing Matters

Most agent workflows still treat model choice too casually.

One model may be good at planning.

Another may be better at code edits.

Another may be cheaper for search, summarization, or test-output analysis.

Another may be stronger for final review.

But without a control layer, model routing becomes guesswork.

A better system should ask:

Is this step worth a premium model?
Can a cheaper model classify this failure?
Should a stronger model review the plan before execution?
Should the run downgrade when budget is tight?
Should the run escalate when repeated failures appear?

Model routing should not just optimize quality.

It should optimize quality within budget.

That matters because the most painful agent failure is not always wrong code.

Sometimes it is expensive unfinished work.

Headless Agents Need More Guardrails, Not Fewer

Headless coding agents are especially interesting.

They can run without a constant human in the loop.

They can process tasks, inspect repos, execute commands, and produce outputs asynchronously.

That is powerful.

But headless execution increases the need for control.

If an agent is running without a developer watching every step, the system needs stronger answers to basic questions:

What is this agent allowed to do?
What budget can it spend?
What commands are blocked?
What verifier defines success?
When should it stop?
When should it ask for approval?
What trace does it leave behind?

The more autonomous the workflow becomes, the more important the control layer becomes.

Autonomy without traceability is not leverage.

It is invisible execution.

Agent Teams Make The Problem Bigger

The next step is not one agent.

It is teams of agents.

A planner agent.

A coding agent.

A reviewer agent.

A test agent.

A documentation agent.

A security agent.

A release agent.

That sounds useful, but it also creates a new coordination problem.

If one agent produces a bad plan, another may execute it.

If the reviewer misses the issue, the system may mark the run complete.

If the test agent checks the wrong verifier, the whole workflow may look successful while still being wrong.

Agent-to-agent workflows need shared state, shared budgets, shared traces, and shared stop conditions.

Otherwise, teams of agents can become teams of budget-burning loops.

The question becomes:

Who governs the team?

That is where a control layer becomes necessary.

What MartinLoop 360 Is Pointing Toward

The direction I am exploring with MartinLoop is a control layer for agentic coding workflows.

The current idea is simple:

Every agent run should be bounded, inspectable, and test-verifiable.

The next layer expands that into a broader loop:

Trace intelligence to understand what happened during a run
Model routing to choose the right model for the right step
HeadlessOS for controlled background execution
MartinLoop 360 as a higher-level view of agent runs, budgets, traces, policies, and outcomes

The goal is not to make agents look more magical.

The goal is to make them easier to trust.

If an agent burns budget and fails, that should be visible.

If an agent loops, that should be classified.

If an agent completes a task, that should be verified.

If multiple agents collaborate, the team should leave one coherent trace.

The Core Loop

A governed agent workflow should look less like this:


text
Prompt → Agent runs → Agent says done

I’m exploring these ideas while building MartinLoop, an open-source control layer for AI coding agents.

GitHub: https://github.com/Keesan12/Martin-Loop 

Website: https://martinloop.com

AI coding agents need receipts, not just better prompts

keesan.eth — Mon, 11 May 2026 17:46:15 +0000

AI coding agents are getting good enough to run real engineering tasks, but not safe enough to run without guardrails.

The failure mode is not always dramatic.

Sometimes the agent just keeps working.

It retries.
It rewrites.
It spends tokens.
It changes files.
It says it is done.

Then another engineer opens the diff and realizes the agent solved the wrong problem.

That creates a new engineering question:

Can another engineer audit this run later?

That is why I’m building MartinLoop.

MartinLoop is an open-source control plane for AI coding agents. The goal is to make every agent run bounded, inspectable, and test-verifiable.

The first version focuses on:

hard budget caps
JSONL run records
audit trails
failure classification
test-verified completion
reproducible agent runs

The thesis is simple:

The next layer of AI coding is not only better prompts.

It is governance.

Before agents touch serious repos, teams need receipts:

what the agent tried
what it changed
how much it spent
what commands it ran
what tests passed
what failed
why it stopped
whether a human can resume, revert, or rerun it

I’m looking for feedback from developers using Claude Code, Codex, Cursor, Devin-style agents, or custom coding agents in real repos.

What would you want in the default “agent receipt”?

GitHub: https://github.com/Keesan12/Martin-Loop
Site: https://martinloop.com

Come Build on Concordium

keesan.eth — Tue, 30 Aug 2022 19:57:03 +0000

Concordium Blockchain is the only public blockchain with a privacy based ID-layer at the protocol level built using RUST. It is the only blockchain with user attributes accessible from Smart Contracts built to be enterprise grade and compliant by nature.

We welcome all #rustdevs to test us out and help us build out our bounties as we look to create new tools for integrations, interoperability, and Dapps.

The blockchain for the future has arrived!