If youāre building tools that use large language models to write or fix code, thereās something you should know:
LLMs lose most of their debugging ability after just 1 or 2 attempts.
This isnāt intuition ā itās measurable.
A new paper from Oxford and McGill introduces something called the Debugging Decay Index.
It tracks how LLMs perform as they try to fix the same code, over and over.
What it shows is simple, but important:
š LLMs Plateau After 2 Fixes
The first fix attempt usually helps.
- The second one adds marginal value.
- By the third, things start breaking down ā repeated edits, semantic drift, hallucinations.
- Performance drops up to 80% by the 5th round.
So if your tool sends the same broken code back to the LLM again and again hoping itāll āget it rightā eventually ā it probably wonāt.
š§ Why It Happens
- LLMs donāt really āunderstandā what went wrong, they often repeat patterns they think look like a fix.
- They get overconfident, even when wrong.
- They lose track of the original intent.
- The more they edit their own output, the worse it gets.
Itās like copying answers from your own wrong homework ā again and again.
š¤ Why This Matters for LLM Builders
If you're working on:
- Autonomous coding agents
- LLM dev assistants
- Auto-debug pipelines
- Or anything where models fix their own mistakesā¦
This research is a warning.
You need to limit how many times the model āself-fixesā before restarting, re-rolling, or using an external signal (like a test suite).
More isnāt better.
More is worse.
š Rethink the Loop
Instead of letting the model rewrite broken code over and over, try this:
- 1ā2 fix attempts max
- Then restart from the original prompt, or
- Try a different approach entirely
- Or use tests + heuristics to guide the fix
Debugging with LLMs isnāt a straight line. Itās a decision tree ā and going in circles doesnāt help.
š§¾ Source
Paper: The Debugging Decay Index (Oxford & McGill, 2024)
Top comments (3)
This is great honestly. At some point you have to roll up your sleeves and read some docs. XD. I have definitely gotten into some dead end situations figuring out stuff with AI. And yeah, sometimes it gets stuck on stuff that isn't the real problem! I am guilty of swapping around asking different AIs the same question. For example, copilot in vs code will give different answers than Gemini and gpt. It has context. But I think its limited in some other ways.
Haha yep, Iāve been there too, Sometimes you just have to stop and read the docs. š
Iāve also tried asking the same thing to different AIs, hoping one gives a better answer. Copilot is cool with its context, but yeah, it still misses stuff. Itās funny how they sometimes get stuck on the wrong thing. makes you realize we still need to do some of the thinking ourselves. Glad Iām not the only one going through this!
I am currently in the middle of cleaning up my own mess on a personal side project. So I'm pretty comfortable with ruby on rails, but when it comes to react and node I am just learning. I had AI make a front end for me thinking it wouldn't be that bad to setup the backend. What a nightmare. I'm about to take a step back and work on an even smaller project for understanding. I jumped WAY off the deep end. Not to mention using mongoDB for the first time, and a VPS. Wooooo! Errors everywhere! XD
I actually got to a point where I looked at my computer and was like "No." AI is not really great for learning a completely new stack on the fly. You are definitely not alone! XD
I was using chatgpt, the ai in the "inspect", copilot in the editor, and Gemini. It is/was a part of confused chaos. One thing i did notice when I first started using Gemini - its a little wordy and a bit abrasive. Of course, I've been tempering my GPT for over a year. LOL