shrey

Posted on Jun 8 • Originally published at shreyshahh.substack.com

The agent that fixes bugs by running the code

#agents #ai #debugging #productivity

I want to tell you about the first time I watched a coding agent fix a bug correctly on the first try.

It wasn't the model. The model was the same one I'd been using for months. It was the loop around the model that had changed.

The agent didn't open three files, skim them, write a confident theory in its head, and edit something that "looked like" the cause. That was the old loop. The one where you run the code, see the same bug, and watch the agent shrug and try another file.

This time the agent did something different. It listed five hypotheses. Picked one. Added some logging. Then it stopped and asked me to reproduce the failure. I clicked the button. The bug fired. The agent read the logs, confirmed the hypothesis, wrote a two-line fix, and pulled its logging back out.

The whole thing took maybe four minutes.

This is the difference between an agent that guesses from source code and an agent that gets evidence from runtime. And it is the difference between the kind of AI-assisted coding most people are doing today and the kind that actually ships.

There's a name forming for the discipline behind this. Martin Fowler's site, in an article by Birgitta Böckeler from earlier this year, uses the term that's been quietly circulating in the agent community for the last few months: harness engineering.

The premise is simple. An agent is a model plus a harness. The model is the language model itself. The harness is everything else. Tools, prompts, feedback loops, error handling, retries, the scaffold of code that decides what the model sees and when it gets to act.

For a long time the conversation in AI coding has been about the model. Bigger model. Smarter model. Newer model. Different provider. Different prompt. Different system message. People have been tuning the inside of the box.

The harness is the box itself. And it turns out that the box is doing more work than anyone wanted to admit.

Watch any agent debug

If you have used any AI coding tool, including the most popular ones, here is the loop you have probably seen.

You give the agent a bug. It opens a handful of files. It reads them. It writes a theory in its head about what's wrong. It edits a line that resembles the kind of code that would cause this kind of bug. It tells you the fix is in. You run the failing scenario. The bug fires again. The agent says "let me look at another file" and the cycle repeats.

Sometimes it gets there. Often it doesn't. And the failures have a particular flavor. They are not the agent being confused. They are the agent being confident.

The reason is structural. The model is reasoning about the code. It is not reasoning about the program. It has read what the code says it does, but it has not seen what the code actually does under your specific inputs, your specific load, your specific race condition.

The code is a hypothesis. The program is the truth. And the model has only ever seen the hypothesis.

Every "obvious fix" that doesn't work is a hypothesis that never met reality.

The shift no one is talking about loudly enough

David Gomes, who works at Cursor, wrote a piece a few weeks back about what they call Debug Mode. The argument he makes is interesting because it is not really about the feature. It is about the principle underneath the feature.

Cursor's Debug Mode does one thing. It teaches the agent to add logging, run the code, and read the logs before it tries to write a fix. There's a clean little loop: hypothesize, instrument, reproduce, read, fix, clean up. The agent doesn't edit your implementation in the same pass it adds logging. It treats logging as a probe and code changes as interventions. It keeps them separate.

Gomes argues, correctly I think, that they could have called this feature Instrumentation Mode. The novelty isn't that the agent debugs. The novelty is that the agent has runtime context.

And once you see this pattern, you start to see it everywhere. The agents that quietly ship working code are the ones whose harness gives them runtime information. The agents that loop forever on the same bug are the ones whose harness only gives them source code.

This isn't a Cursor story. It's the same idea you find in Tejas Kumar's IBM talk on harnesses, in Ryan Lopopolo's work at OpenAI, in Manjeet's writing on the evolution from prompts to context to harness, in Karpathy's repeated point that the apparatus around the model is doing the heavy lifting once you leave the playground.

A model alone is a closed book. A harness is what gives the model a window into the running world.

What runtime evidence actually catches

Here is the bit that makes harness engineering not just a fancy idea but an immediate, measurable improvement to how your agents work.

There are entire categories of bugs that no amount of code reading will surface. They live in the program, not in the source. They show themselves only when the program runs.

The 1-in-20 race condition that corrupts state under concurrency. No file will tell you about it. You have to see two log lines arrive in the wrong order.

The memory leak that takes thirty minutes of traffic to show up. You will never see it by staring at object lifetimes. You see it in a heap snapshot.

The native crash inside a compiled dependency. The agent can read your Python all day. The crash is in C.

The rendering bug that has been sitting in your ticket queue for six months because every senior on your team gave up. The DOM under load doesn't look like the DOM in the source.

The Heisenbug that vanishes the moment you put a breakpoint near it. The only way to find these is by leaving the program alone and observing it.

I have an agent loop that catches these now. It is not because I changed models. It is because I built the harness around the model to give it runtime evidence by default. Hypotheses get ranked, logging gets added, scenarios get reproduced, logs get read, fixes get written, instrumentation gets cleaned up.

Same model. Different harness. Different outcome.

The discipline that makes it work

If you only take one thing from this, take this one.

Never let your agent log and edit in the same pass.

The whole evidence loop falls apart the moment you mix them. If the agent adds logging and changes the implementation at the same time, and then the bug behaves differently, you cannot tell which change moved the needle. Was it the new logic? Was it the side effect of the instrumentation? You don't know. You're guessing again. With extra steps.

Logging is a probe. Code changes are interventions. Treat them as different operations. Run them as different passes. The agent observes first. The agent intervenes second. Never both in one swing.

This sounds obvious. It is not what most agents do by default. Most agents, given a task labeled "fix this bug," will read three files and start writing a patch. The patch and the print statements and the experiment all go in at once. And then the agent reads the new output and either declares victory or starts over, with no clean signal about what its own change did.

A real harness enforces the separation. The agent is not allowed to write to a file labeled "implementation" during a logging pass. That is a rule the harness applies, not a rule the model has to remember.

The part you cannot automate away

Here's the uncomfortable part. The reproducer in this loop is you.

You are the one who can click the button, deploy to staging, exercise the failing test, send the request that triggers the bug. The agent can do a lot. It usually can't reliably re-trigger a bug that lives in a system more complicated than a unit test.

This is the same point Gomes makes about Debug Mode adoption. The feature is genuinely useful. Most users don't use it. It requires a human who is engaged enough to do the reproduction step. And most code from here on is going to be written by humans who are not particularly engaged.

I keep coming back to this. The future of AI coding is not "press a button and the agent ships." It is a tight, fast loop where the agent does the tedious instrumentation and inference work and you do the part of the work that requires a body in front of a screen and a finger on a button.

That isn't a downgrade. It's an upgrade. You stop being a code-typer and start being a debugger-of-debuggers. Your judgment matters more, not less.

The mental model shift

Stop trying to make the model smarter. The smartest model in the world still cannot see what it cannot see.

What you're actually building, when you build a serious agent system, is the apparatus around the model. Tools. Logs. Retries. State persistence. Verification steps. Ways for the agent to test its own hypotheses against reality instead of asking you to confirm them.

Fowler's article frames this as feedforward controls (guides that shape what the agent does before it acts) and feedback controls (sensors that catch mistakes after the agent acts). Both matter. Without feedforward, the agent is unsupervised. Without feedback, the agent repeats the same mistakes forever.

The bug-fixing loop I described above is one specific instance of this. The hypotheses are feedforward, shaped by your codebase and the bug description. The logs are feedback, telling the agent which hypothesis survived contact with the running program.

You can build the same pattern for code review, for refactoring, for migrations, for performance work, for anything where the model needs to be told what reality looks like instead of inventing reality from source.

That is the work. That is harness engineering.

The model didn't get smarter. The harness did.

What to do this week

Don't wait for a vendor to ship this for you. If you're using an agent that doesn't have a built-in evidence loop, build one yourself. The pieces are not exotic.

Start with these three:

A logging convention your agent can read. Pick a format, write a one-line skill or AGENTS.md note explaining where logs go and how to read them, and make sure your agent always emits in that format when instrumenting.

A separation rule. Tell your agent, in whatever instruction file it reads, that logging and implementation changes must happen in different passes. Add it as a checkbox in your task templates if you need to.

A reproducer you can actually run. The faster your reproduction loop, the faster the evidence loop. If you're sitting in front of a 90-second test cycle, your agent is going to be sitting too. Cut that to 10 seconds and you'll fix bugs five times faster, regardless of which model you use.

These three changes are not a feature release. They are an attitude. They are what it looks like to take the harness as seriously as you take the model.

If you do, I promise you the next bug your agent claims to have fixed will actually be fixed.

Sources I found useful while writing this:

Martin Fowler / Birgitta Böckeler, Harness engineering for coding agent users (April 2026)
David Gomes, Cursor's Debug Mode Is Arguably Its Best Feature
Cursor team, Introducing Debug Mode: Agents with runtime logs
Eric Zakariasson (Cursor), original announcement thread on X
Manjeet, Context → Harness Engineering: The Evolution of Building AI Agents

If this resonated, I'd love to hear what your reproducer looks like and where your evidence loop is breaking. Reply on Substack or drop a comment here.

Originally published on my Substack.

— Shrey

Top comments (1)

xulingfeng • Jun 9

The "model vs program" distinction hit hard. I've watched agents confidently explain a bug from static source analysis that turned out to be completely wrong once the code actually ran. The gap between "what the code says" and "what the program does" is exactly where most agent debugging loops die.

Harness engineering is a good name for what's missing. The hypothesis-driven loop you described — list, pick, instrument, reproduce, confirm — feels like the difference between an intern guessing and a senior engineer running experiments. Most agents stop at step 2 (pick one) and skip straight to editing.

Curious how you handle the harness cost — does instrumenting for every hypothesis eat into response time enough to be noticeable?