While experimenting with AI-powered code review systems, I kept running into a strange problem.
The model would generate a perfectly reasonable review comment.
The code issue was real.
The explanation made sense.
But the comment was attached to a line that didn’t exist in the pull request.
At first, I assumed this was just another example of LLM hallucination.
After digging deeper, I found something more specific.
The Problem Isn’t Code Understanding
Most modern LLMs are surprisingly good at understanding code changes.
They can often identify:
- Potential bugs
- Missing edge cases
- Naming issues
- Logic problems
The strange part was that many review comments correctly identified a problem while referencing the wrong line.
The model understood what was wrong.
It failed to understand where it was.
The Unified Diff Trap
Most AI review systems operate on unified diffs.
A simplified example looks like this:
@@ -120,7 +120,8 @@
-const timeout = 3000;
+const timeout = 10000;
initialize();
Humans rarely think about line coordinates because GitHub handles them automatically.
For an LLM, however, the situation is different.
The model must reconstruct file positions using:
- Hunk headers
- Added lines
- Deleted lines
- Context lines
- Running offsets
A single counting mistake can shift every coordinate that follows.
What I Observed
Across repeated testing, several patterns appeared frequently:
- Deleted-Line References
The model sometimes generated comments that pointed to deleted lines.
The feedback itself was often valid.
The target location wasn’t.
- Coordinate Drift
Large diffs increased the error rate significantly.
After enough additions and deletions, line references would gradually drift away from the intended location.
- Out-of-Range Targets
Occasionally, comments referenced line numbers that simply didn’t exist inside the patch.
These comments could not be attached to the pull request at all.
Why Prompt Engineering Wasn’t Enough
My first instinct was to improve prompting.
I tried:
- More explicit instructions
- Structured outputs
- Additional examples
- Coordinate reminders
The error rate improved.
It never disappeared.
The reason seems straightforward.
Predicting text and maintaining exact positional bookkeeping are fundamentally different tasks.
A model can understand a code issue while simultaneously making a counting error.
A Different Approach
Eventually I stopped treating coordinates as trusted output.
Instead of assuming the model was correct, I added a deterministic verification step.
Every generated review comment is checked against the actual diff structure before being returned.
The validator verifies:
- File existence
- Valid patch coordinates
- Added-line targets
- Hunk boundaries
- Out-of-range references
If a comment fails validation, it is either corrected or discarded.
The goal isn’t to make the reviewer smarter.
The goal is to prevent invalid comments from reaching the pull request.
Final Thoughts
One lesson stood out during this project:
Semantic understanding and coordinate accuracy are different problems.
LLMs are often better at the first than the second.
As AI tooling becomes more integrated into developer workflows, deterministic validation layers may become just as important as the models themselves.
I ended up open-sourcing the implementation here:
GitHub: https://github.com/ywu593412-afk/DiffLens
I’m curious whether other developers building AI review systems have encountered similar coordinate-mapping issues.
Top comments (0)