DEV Community

Cover image for šŸ” LLMs Get Worse at Debugging Their Own Code
Vahid A.Nezhad
Vahid A.Nezhad

Posted on

šŸ” LLMs Get Worse at Debugging Their Own Code

If you’re building tools that use large language models to write or fix code, there’s something you should know:

LLMs lose most of their debugging ability after just 1 or 2 attempts.

This isn’t intuition — it’s measurable.

A new paper from Oxford and McGill introduces something called the Debugging Decay Index.
It tracks how LLMs perform as they try to fix the same code, over and over.

What it shows is simple, but important:

šŸ“‰ LLMs Plateau After 2 Fixes


The first fix attempt usually helps.

  • The second one adds marginal value.
  • By the third, things start breaking down — repeated edits, semantic drift, hallucinations.
  • Performance drops up to 80% by the 5th round.

So if your tool sends the same broken code back to the LLM again and again hoping it’ll ā€œget it rightā€ eventually — it probably won’t.

🧠 Why It Happens

  • LLMs don’t really ā€œunderstandā€ what went wrong, they often repeat patterns they think look like a fix.
  • They get overconfident, even when wrong.
  • They lose track of the original intent.
  • The more they edit their own output, the worse it gets.

It’s like copying answers from your own wrong homework — again and again.

šŸ¤– Why This Matters for LLM Builders

If you're working on:

  • Autonomous coding agents
  • LLM dev assistants
  • Auto-debug pipelines
  • Or anything where models fix their own mistakes…

This research is a warning.

You need to limit how many times the model ā€œself-fixesā€ before restarting, re-rolling, or using an external signal (like a test suite).

More isn’t better.
More is worse.

šŸ”„ Rethink the Loop

Instead of letting the model rewrite broken code over and over, try this:

  • 1–2 fix attempts max
  • Then restart from the original prompt, or
  • Try a different approach entirely
  • Or use tests + heuristics to guide the fix

Debugging with LLMs isn’t a straight line. It’s a decision tree — and going in circles doesn’t help.

🧾 Source
Paper: The Debugging Decay Index (Oxford & McGill, 2024)

Top comments (3)

Collapse
 
annavi11arrea1 profile image
Anna Villarreal

This is great honestly. At some point you have to roll up your sleeves and read some docs. XD. I have definitely gotten into some dead end situations figuring out stuff with AI. And yeah, sometimes it gets stuck on stuff that isn't the real problem! I am guilty of swapping around asking different AIs the same question. For example, copilot in vs code will give different answers than Gemini and gpt. It has context. But I think its limited in some other ways.

Collapse
 
vahidoo profile image
Vahid A.Nezhad

Haha yep, I’ve been there too, Sometimes you just have to stop and read the docs. šŸ˜…
I’ve also tried asking the same thing to different AIs, hoping one gives a better answer. Copilot is cool with its context, but yeah, it still misses stuff. It’s funny how they sometimes get stuck on the wrong thing. makes you realize we still need to do some of the thinking ourselves. Glad I’m not the only one going through this!

Collapse
 
annavi11arrea1 profile image
Anna Villarreal

I am currently in the middle of cleaning up my own mess on a personal side project. So I'm pretty comfortable with ruby on rails, but when it comes to react and node I am just learning. I had AI make a front end for me thinking it wouldn't be that bad to setup the backend. What a nightmare. I'm about to take a step back and work on an even smaller project for understanding. I jumped WAY off the deep end. Not to mention using mongoDB for the first time, and a VPS. Wooooo! Errors everywhere! XD

I actually got to a point where I looked at my computer and was like "No." AI is not really great for learning a completely new stack on the fly. You are definitely not alone! XD

I was using chatgpt, the ai in the "inspect", copilot in the editor, and Gemini. It is/was a part of confused chaos. One thing i did notice when I first started using Gemini - its a little wordy and a bit abrasive. Of course, I've been tempering my GPT for over a year. LOL