Ernesto Herrera Salinas

Posted on Jun 26

The Iron Man Protocol: Turning Agent Mistakes Into Durable Engineering Feedback

#ai #agents #openai #productivity

How a Team Can Make a Coding Agent Improve Without Retraining It

Tony Stark does not get stronger by force of will. He gets stronger because he is an engineer who treats every failed suit as evidence: he records what broke, changes the armor, adds a safeguard, and tests the next version under harder conditions. The suit does not learn. Stark learns, and the suit carries the lesson forward.

That distinction is the whole point of this article, so it is worth being precise about who plays which role. A coding agent like Codex is the suit. Your team is Stark. Codex does not update its model weights when it misunderstands a repository convention, skips a test, or patches the wrong file. That interaction does not retrain the underlying model. But the system around the agent can be re-engineered after failures so the next run carries the lesson:

That loop is the Iron Man Protocol. The rest of this article is how to run it without it collapsing under its own weight.

Who This Is For

This article is for teams using a coding agent for repository work, code review, CI fixes, migrations, generated-code workflows, and other repeated engineering tasks, especially anyone who has watched the same agent mistake happen twice and wants something stronger than "please remember next time."

The named surfaces below (AGENTS.md, skills, hooks, memories, evals) are spelled the way Codex spells them. Many coding-agent platforms expose comparable primitives, though names, locations, and enforcement behavior differ. Treat "Codex" throughout as the concrete example, not as the only possible implementation.

The One Idea That Matters

Make the next failure harder to repeat by changing the environment, not by exhorting the model.

This rests on a single research finding, stated once: language models are unreliable at correcting their own reasoning when the only signal is their own reflection. They improve far more dependably when given an external, executable signal: a failing test, a trace, a CI log, a reviewer comment, a concrete rule. [4][5] The nuance matters: self-correction is not worthless; it works when the model can check its work against something real. [1] So the engineering goal is not "tell the agent to reflect harder." It is "put a real signal in front of the agent."

That is the difference between a weak loop and a strong one:

Weak:   Think harder and avoid that mistake again.

Strong: Reproduce the failing case, find the root cause, patch the smallest
        responsible code path, add a regression test, update the repository
        rule if this convention recurs, and rerun the relevant suite.

The strong loop hands the agent evidence. It turns a mistake into a constraint. Everything below is machinery for producing those constraints, and for removing them once they stop paying rent.

The Research, Briefly

A few results carry this argument:

Self-correction has limits. Models often fail to fix their own reasoning without a reliable external signal. [4][5] This is the reason the protocol refuses to depend on the model "noticing" its mistake.
Verbal feedback can persist across attempts. Reflexion showed agents improving without weight updates by storing linguistic feedback from prior tries. [2] That is the mechanism the protocol industrializes.
A growing, reusable skill library compounds. Voyager paired execution feedback and self-verification with an ever-expanding skill set. [8] Tests, skills, hooks, evals, and AGENTS.md are that skill library for a coding agent.

Software-engineering benchmarks reinforce the practical shape: coding agents need repository context, usable tools, good localization, and validation loops. [9][10] Agentless adds a useful caution: simpler structured pipelines for localization, repair, and patch validation can outperform more autonomous loops. [11] The lesson is not "maximize autonomy." It is "build reliable feedback channels."

Adjacent research on iterative refinement, tool-using agents, and search-style reasoning is relevant to the broader agent landscape, but it is not the core proof for this protocol. [3][6][7] Model-level approaches such as process supervision, reasoning bootstrapping, and reinforcement learning for self-correction may improve future models, but most teams cannot touch training-time systems. [18][19][20] For day-to-day engineering teams, the leverage stays in the environment.

What "Learning" Actually Means Here

Three kinds of learning, only one of which you control:

Kind	Where it lives	Lifespan	Your leverage
Training-time	The model weights	Permanent	Almost none
In-context	The current thread	Until the thread ends	Real but temporary
System-level	The repo, CI, instructions, skills	Until you retire it	High: this is the protocol

If the agent fixes a bug after seeing a failing test, it improved in the session. If the repository now carries a regression test that blocks that bug from returning, the system improved. The Iron Man Protocol is entirely about that third row.

The Protocol

Seven steps. The seventh is the one most "lessons learned" systems forget, and the reason most of them rot.

1. Capture the failure

Do not let the failure live only in a chat transcript. Record it while the evidence is fresh:

## Mistake Record
- Date / Task:
- What the agent did / What was wrong:
- How it was detected:
- Root cause:
- Correct behavior:
- Durable artifact needed (test / rule / skill / hook / memory / eval):

This is compression, not bureaucracy: a confusing interaction becomes a reusable lesson.

2. Diagnose along two axes

Earlier versions of this protocol used a flat list of seven failure "types." That list leaks: real failures rarely sit in one bucket. The worked example below is a context failure and a verification failure at once. Diagnose along two orthogonal axes instead:

Context axis: did the agent lack something it needed to know? A convention, dependency, product rule, hidden workflow step?
Verification axis: did the agent fail to check what it changed? Wrong test, skipped lint/typecheck, no reproduction, missed regression?

Most mistakes score on both. The axes matter because they point to different artifacts: context gaps become knowledge (an AGENTS.md rule, a skill); verification gaps become enforcement (a test, a hook, a CI check). The familiar labels, such as bad localization, over-broad change, tool misuse, review miss, and memory gap, are common positions on these two axes, not separate categories.

3. Choose the durable artifact

The rule: never store an important lesson only as prose in a conversation. Match the artifact to the failure:

Failure	Durable upgrade	Use when	Do not use when
Recurring repo convention	`AGENTS.md` rule	It should govern future work in the repo/subtree.	It is temporary, personal, or better enforced by a test.
Skipped validation	Hook / CI check / test command	The same check keeps getting missed.	It is noisy, slow, or not yet well understood.
Behavioral bug	Regression test	The issue is user-visible and reproducible.	It is not yet deterministic enough to encode.
Repeated multi-step task	Skill	The task has repeatable steps, references, or scripts.	One sentence in `AGENTS.md` is enough.
Stable local preference	Memory	It helps future local sessions only.	It must bind a team or repo; use checked-in rules.
Agent-behavior failure	Eval case	You need to measure whether behavior improved.	You have no harness, expected behavior, or scoring.
Sequence/handoff failure	Trace + postmortem	The lesson depends on ordering or context.	A small test or rule already captures it.

Codex-specific surfaces have sharp edges. AGENTS.md is durable repository guidance. [12] Memories are optional local recall, not a system of record for mandatory team policy. [13] Skills are reusable workflows. [14] Hooks are lifecycle automation and require configuration and trust review. [15] Evals require a defined task, expected behavior, harness, and scoring method. [16][17]

Every artifact you add is a liability as well as an asset: it must be maintained, it consumes attention or context, and it can go stale. Add it deliberately, and tag it so Step 7 can find it later.

4. Make the lesson verifiable

A lesson is only as good as the evidence the agent can check it against. This is the highest-leverage move in the protocol, so be ruthless about specificity:

Weak:    - Be careful with authentication.
Strong:  - When editing OAuth callback handling, run `npm test -- auth callback`
           and verify expired-state rejection.

Weak:    - Do not break migrations.
Strong:  - When changing DB migrations, include rollback coverage and run
           `make test-migrations`.

A strong instruction names a trigger ("when editing X"), an action ("run Y"), and an observable result ("verify Z"). The agent can follow specific rules and check specific evidence; it cannot reliably operationalize a vague warning. If you can write only one artifact per failure, write this one.

5. Patch the workflow, not just the code

After every fix, ask: what should exist now so this class of mistake is less likely next time? Step 3 chooses the artifact; Step 5 wires it in: checked into the repo, referenced from AGENTS.md, enforced in CI, packaged as a skill, or run by a configured hook. A lesson that is not wired into the workflow is just a nicer transcript.

6. End with evidence, when the change warrants it

Consequential or correctness-affecting changes should close with evidence. Trivial edits should not; forcing a ceremony on every keystroke just trains people to rubber-stamp it. When it matters, the final report answers:

What changed, and what failure was reproduced?
What check now passes that previously failed?
What durable artifact was added, updated, or retired?
What risk remains?

7. Retire what no longer earns its place

This is the step that keeps the protocol from defeating itself. A pure "capture every lesson" ratchet ends exactly where the agent performs worst: a bloated AGENTS.md that dilutes its own context window, a slow and flaky regression suite, and rules that contradict each other. Durability without pruning is just accumulation.

Give the system a maintenance budget and a retirement rule:

Budget the surfaces. Cap AGENTS.md, for example by requiring the active rules to fit on one screen of scoped, action-oriented lines. When you add a rule that pushes past the cap, retire, merge, or demote an older one.
Date and review. Tag each artifact with the failure it came from and the last time it fired. Review instruction files monthly, or whenever a rule causes confusion. Sweep for artifacts that have not triggered, no longer apply, or duplicate a stronger check.
Resolve conflicts at write time. Before adding a rule, search the existing ones. If the new rule contradicts or supersedes an existing rule, edit in place rather than appending a second voice.
Prefer enforcement over prose. A CI check is more self-checking than prose because failure creates a visible signal. But checks also get flaky, slow, and obsolete, so they need the same retirement discipline as instructions.

Retirement is not optional cleanup; it is what makes "durable" mean load-bearing rather than permanent.

A Minimal Template

# Iron Man Protocol Entry
## Failure        - what happened / how detected
## Diagnosis      - root cause / context-axis + verification-axis scoring
## Upgrade        - code or test change / instruction update / skill|hook|memory|eval
## Verification   - command run / result / remaining risk
## Bookkeeping    - artifact id + date added; rule it replaces, if any

The point is not a long incident report for every small issue. It is to avoid losing the lesson, and to leave a trail Step 7 can prune.

How You Know It Is Working

The protocol's own thesis is that assertion is weak and evidence is strong, so it must hold itself to the same standard. Do not claim the protocol works. Measure it.

The minimum viable version is a mistake log with tags:

- 2026-06-10: schema-client-drift / context + verification / fixed by check-generated
- 2026-06-18: missed-auth-test / verification / fixed by AGENTS rule + test command
- 2026-06-25: bad-localization / context / fixed by routing note in AGENTS.md

Once the log exists, track three simple signals:

Repeat-failure rate: how often a previously captured failure class recurs.
Artifact hit-rate: how often each durable artifact catches a real problem.
Surface budget: size of AGENTS.md, suite runtime, and number of active rules.

If repeat failures stay flat, the artifacts are not changing behavior. If artifacts never fire, they are retirement candidates, not trophies. If surface budget climbs monotonically, your retirement step is broken.

A worked example follows. It is illustrative, not evidence. Substitute your own numbers from a real incident, because a real before/after is the only thing that proves the loop pays off.

Worked Example (illustrative)

A task adds a preferred_locale field to an account API response. The code looks fine and backend tests pass, but CI fails later because the generated TypeScript client still expects the old shape. A reviewer notes this has happened before.

Diagnosis. Context axis: the schema-to-client generation step was invisible in the prompt and the repo rules. Verification axis: no drift check ran before the change was considered done. Both axes, one incident.

AGENTS.md, before -> after.

- Run backend tests before opening a PR.

- When changing API schema files, run `make clients` and commit the regenerated
  client in the same change.
- After clients regenerate, run `make check-generated` before stopping.

Enforcement. Add or confirm a make check-generated drift check in CI that fails when schema changes are not reflected in generated client files. The check is the primary guard; the prose rule tells the agent when to run it before CI does.

Verification.

- Reproduced: `make check-generated` failed before regeneration.
- Repair: ran `make clients`.
- Result: `make check-generated` passed; schema, mapper, and generated client all changed.
- Remaining risk: a future schema package with a different generator would need the rule extended to name it.
- Bookkeeping: schema-client-drift-2026-06; supersedes the bare "run backend tests" line.

Step 7 follow-through. Because the drift check now enforces the behavior in CI, the prose rule is the weaker, higher-maintenance copy. At the next sweep, consider trimming the AGENTS.md lines to a single pointer: "schema changes are gated by make check-generated in CI." That keeps the green check, not the paragraph, closest to the source of truth.

A 30-Minute Start

Pick one recurring failure from the last week. Write one mistake entry. Add or update one regression test, validation command, or CI check that would have caught it. Add one concrete, triggered AGENTS.md rule that says exactly when to run that check. Rerun the validation and record the result.

Then, if instructions already exist, delete, merge, demote, or mark one stale rule for later review. You have now run the full loop, including retirement, on a single failure.

What to Avoid

"Be more careful" prompts. Too vague to enforce; this is the entire reason for Step 4.
Memory as the system of record. Mandatory project rules belong in checked-in instructions, tests, CI, hooks, or skills, not in per-session memory.
Append-only instruction files. Capturing every lesson and keeping files small are only compatible if you actually run Step 7. Adding a rule without budgeting for its retirement is how AGENTS.md becomes noise.
Evals from imaginary failures. The best eval cases come from real bugs, user corrections, production incidents, CI failures, and reviewer comments.
Automating too early. Understand the failure first. A hook that runs the wrong check is negative value.

The Final Idea

A coding agent does not permanently retrain itself after a bad interaction. But the environment around it can be re-engineered after failures, and pruned when the lessons go stale. The repo gains tests; CI gains checks; the agent gains skills and sharper rules; the eval suite gains real failure cases; and the instruction files stay small because what no longer fires gets removed.

Stark survives because the next suit carries the lesson from the last fight, and because he strips the parts that stopped working. Build the agent's environment the same way.

References

[1] OpenAI Developers, "Prompting - Codex." https://developers.openai.com/codex/prompting

[2] Noah Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning." https://arxiv.org/abs/2303.11366

[3] Aman Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback." https://arxiv.org/abs/2303.17651

[4] Jie Huang et al., "Large Language Models Cannot Self-Correct Reasoning Yet." https://arxiv.org/abs/2310.01798

[5] Ryo Kamoi et al., "When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs." https://arxiv.org/abs/2406.01297

[6] Shunyu Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models." https://arxiv.org/abs/2210.03629

[7] Shunyu Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." https://arxiv.org/abs/2305.10601

[8] Guanzhi Wang et al., "Voyager: An Open-Ended Embodied Agent with Large Language Models." https://arxiv.org/abs/2305.16291

[9] Carlos E. Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" https://arxiv.org/abs/2310.06770

[10] John Yang et al., "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." https://arxiv.org/abs/2405.15793

[11] Chunqiu Steven Xia et al., "Agentless: Demystifying LLM-based Software Engineering Agents." https://arxiv.org/abs/2407.01489

[12] OpenAI Developers, "Custom instructions with AGENTS.md - Codex." https://developers.openai.com/codex/guides/agents-md

[13] OpenAI Developers, "Memories - Codex." https://developers.openai.com/codex/memories

[14] OpenAI Developers, "Agent Skills - Codex." https://developers.openai.com/codex/skills

[15] OpenAI Developers, "Hooks - Codex." https://developers.openai.com/codex/hooks

[16] OpenAI Developers, "Working with evals." https://developers.openai.com/api/docs/guides/evals

[17] OpenAI Cookbook, "Build an Agent Improvement Loop with Traces, Evals, and Codex." https://developers.openai.com/cookbook/examples/agents_sdk/agent_improvement_loop

[18] Hunter Lightman et al., "Let's Verify Step by Step." https://arxiv.org/abs/2305.20050

[19] Eric Zelikman et al., "STaR: Bootstrapping Reasoning With Reasoning." https://arxiv.org/abs/2203.14465

[20] Aviral Kumar et al., "Training Language Models to Self-Correct via Reinforcement Learning (SCoRE)." https://arxiv.org/abs/2409.12917

DEV Community