shy The

Posted on Jun 1

Why AI Code Review Tools Keep Commenting on Lines That Don’t Exist

#ai #llm #softwareengineering #tooling

While experimenting with AI-powered code review systems, I kept running into a strange problem.

The model would generate a perfectly reasonable review comment.

The code issue was real.

The explanation made sense.

But the comment was attached to a line that didn’t exist in the pull request.

At first, I assumed this was just another example of LLM hallucination.

After digging deeper, I found something more specific.

The Problem Isn’t Code Understanding

Most modern LLMs are surprisingly good at understanding code changes.

They can often identify:

Potential bugs
Missing edge cases
Naming issues
Logic problems

The strange part was that many review comments correctly identified a problem while referencing the wrong line.

The model understood what was wrong.

It failed to understand where it was.

The Unified Diff Trap

Most AI review systems operate on unified diffs.

A simplified example looks like this:
@@ -120,7 +120,8 @@
-const timeout = 3000;
+const timeout = 10000;
initialize();
Humans rarely think about line coordinates because GitHub handles them automatically.

For an LLM, however, the situation is different.

The model must reconstruct file positions using:

Hunk headers
Added lines
Deleted lines
Context lines
Running offsets

A single counting mistake can shift every coordinate that follows.

What I Observed

Across repeated testing, several patterns appeared frequently:

Deleted-Line References

The model sometimes generated comments that pointed to deleted lines.

The feedback itself was often valid.

The target location wasn’t.

Coordinate Drift

Large diffs increased the error rate significantly.

After enough additions and deletions, line references would gradually drift away from the intended location.

Out-of-Range Targets

Occasionally, comments referenced line numbers that simply didn’t exist inside the patch.

These comments could not be attached to the pull request at all.

Why Prompt Engineering Wasn’t Enough

My first instinct was to improve prompting.

I tried:

More explicit instructions
Structured outputs
Additional examples
Coordinate reminders

The error rate improved.

It never disappeared.

The reason seems straightforward.

Predicting text and maintaining exact positional bookkeeping are fundamentally different tasks.

A model can understand a code issue while simultaneously making a counting error.

A Different Approach

Eventually I stopped treating coordinates as trusted output.

Instead of assuming the model was correct, I added a deterministic verification step.

Every generated review comment is checked against the actual diff structure before being returned.

The validator verifies:

File existence
Valid patch coordinates
Added-line targets
Hunk boundaries
Out-of-range references

If a comment fails validation, it is either corrected or discarded.

The goal isn’t to make the reviewer smarter.

The goal is to prevent invalid comments from reaching the pull request.

Final Thoughts

One lesson stood out during this project:

Semantic understanding and coordinate accuracy are different problems.

LLMs are often better at the first than the second.

As AI tooling becomes more integrated into developer workflows, deterministic validation layers may become just as important as the models themselves.

I ended up open-sourcing the implementation here:

GitHub: https://github.com/ywu593412-afk/DiffLens

I’m curious whether other developers building AI review systems have encountered similar coordinate-mapping issues.