DEV Community: shy The

[Boost]

shy The — Thu, 04 Jun 2026 16:09:11 +0000

GitHub “Finish-Up-A-Thon” Challenge Submission

shy The

Jun 3

I Thought My AI Code Reviewer Was Finished. Then a Single Hallucinated Line Number Broke Everything.

#devchallenge #githubchallenge #ai #typescript

6 min read

I Thought My AI Code Reviewer Was Finished. Then a Single Hallucinated Line Number Broke Everything.

shy The — Wed, 03 Jun 2026 11:04:20 +0000

What I Built

Difflens is a LangGraph-powered automated code review pipeline that analyzes pull requests and posts review comments through GitHub Actions.

I thought building a multi-agent reviewer would be the hard part. I was wrong.

The real challenge was making sure AI-generated comments could actually survive GitHub's API validation.

** Why My PR Comments Kept Disappearing **
Getting the pipeline to trigger GitHub Actions felt like a huge win. But then, I noticed a fatal issue: the comments were disappearing.

The logs showed the LLM was generating brilliant security and logic feedback, but on the actual Pull Request page, nothing was posted. After digging into the Octokit error logs, I found a subtle reliability bug: Coordinate Hallucination.

Even when the model correctly identified a vulnerability, it couldn't reliably anchor that feedback to a valid line in the Git Diff. My system had a strict VerifierNode designed to block invalid API requests. When it saw these hallucinated out-of-bound coordinates, Octokit threw a hard 422 Unprocessable Entity error: Validation Failed: {"resource":"PullRequestReviewComment","code":"invalid","field":"line"}

It silently dropped the comments. A single hallucinated number was wiping out the entire review pipeline.

The Fix: Deterministic Guardrails

During early testing, I discovered that many otherwise valid review comments were never reaching GitHub because their generated coordinates failed validation.

The core issue isn't just the AI being 'wrong'—it's that the system trusts non-deterministic output to interact with a strict API. I implemented a deterministic layer to bridge this gap. Instead of hoping the model gets line numbers right, I parse the raw diff into an index map as a source of truth. The ⁠VerifierNode⁠ uses this map to intercept and sanitize agent outputs before they ever hit the GitHub API.

At the boundary of this layer, I enforce a hard check to strip away any out-of-bounds coordinates or malformed responses before they hit the GitHub API. To generate the 'source of truth' for this check, I implemented the parser below. It processes raw diff hunks into a granular, line-by-line index map. It doesn't just check bounds; it handles line-number offsets to ensure consistent alignment before the agent's feedback is ever posted.

function parseDiffToValidLines(filePatch: string): Set<number> {
  const validLines = new Set<number>();

  // 1. Git Diff Protocol Compatibility: When changes are only 1 line, ',1' is omitted.
  // Regex fix: Added ^ and m flags to anchor at the start of the line, preventing false matches within code content.
  // Make the ',count' for both left and right sides optional capture groups (?:,...)
  const hunkHeader = /^@@ -\d+(?:,\d+)? \+(\d+)(?:,(\d+))? @@/gm;
  let match;

  while ((match = hunkHeader.exec(filePatch)) !== null) {
    const start = parseInt(match[1], 10);
    // If the second group (count) is not matched, it means there is only 1 line of change. Default to 1.
    const count = match[2] ? parseInt(match[2], 10) : 1;

    // 2. Design Choice: Why extract only the line numbers after '+'?
    // Because the GitHub PR Review API strict rules require anchoring comments to valid lines in the NEW file (Right Side).
    // This loop fully extracts all context and added lines within the current block.
    for (let i = start; i < start + count; i++) {
      validLines.add(i);
    }
  }
  return validLines;
}

Implementing this parseDiffToValidLines parser was the real engineering bottleneck. Translating raw unified diffs into a stable TypeScript Set requires meticulous edge-case handling—accounting for context lines, additions, and GitHub API's strict positioning rules.

Copilot turned out to be most useful when dealing with repetitive regex iterations, edge cases, and test scaffolding. Wrestling with standard diff protocols is a nightmare; handling multi-hunk shifts and avoiding encoding offsets is incredibly tedious.

Instead of debugging regex boundary errors for days, I laid out the core mathematical constraints, and Copilot did the heavy lifting. We co-authored the precise regex patterns for hunk headers and built an exhaustive suite of unit tests to smoke out boundary shifts. This tight feedback loop compressed days of frustrating manual diff-parsing into a single afternoon of rapid iteration.
(If the diff is empty or malformed, the regex match fails gracefully, returning an empty set and blocking the comment pipeline entirely.)

The Result:From Defensive Filtering to Proactive Injection

During early testing across 5 intentionally varied dummy pull requests, 9 out of 14 generated comments were rejected due to invalid coordinates. While the sample size is small, it was enough to expose a structural reliability issue. That's a 64% silent failure rate—brilliant engineering ideas lost in the ether simply because the AI couldn't read the Git Diff index accurately.

The breakthrough came when I inverted the architecture. By shifting the deterministic validLines map upstream, I injected it directly into the LLM's prompt context as a hard constraint. The system transitioned from a "try-and-fail" model to a "pre-verified" execution model. By forcing the agent to anchor its reasoning to the deterministic list before generating coordinates, the coordinate hallucination issue was effectively neutralized⁠.

The Next Horizon: From Hackathon to Production-Grade

While the deterministic guardrail solved our immediate coordinate hallucination crisis, taking Difflens to a true enterprise-grade standard requires a structured architectural evolution.

I am currently mapping out the next phase: implementing cross-file comment deduplication using path + line + content hashing, and building a hallucination telemetry pipeline to systematically trace prompt drift. These aren't just features—they are the safety nets required to transition from a single-agent prototype to a production-ready, multi-agent orchestrator.

Copilot was helpful here as well, particularly for prototyping validation flows and data structures around comment deduplication and telemetry collection.

The Impact (Before vs. After)

Before (initial benchmark across 5 multi-file PRs): 14 suggestions generated ➔ 5 posted (64% lost due to hallucination)
After: 14 suggestions generated ➔ 14 posted (0% lost during testing)

Demo

Automated PR Feedback (GitHub Action):
Engineering Implementation (VS Code):

(Note on the output: The system successfully intercepted an empty validation array. This demonstrates the deterministic guardrail in action—rather than firing a broken API call, the bot executed a graceful fallback.)

Conclusion

I spent days wondering why brilliant code reviews were disappearing into the void, all because of a single hallucinated line number. Building Difflens taught me that we need to stop expecting LLMs to be perfect. The turning point wasn't a better prompt; it was accepting the AI's flaws.

By letting the agents do the creative reasoning, but forcing their output through a ruthless, old-school validation layer, the silent failures finally stopped. Blindly trusting AI "magic" is a production hazard. We have to build a solid, deterministic box for that magic to safely operate in.This is where Copilot continues to be useful—it acted as an engineering accelerator,helping me build the rigid boundaries and safety nets needed to actually make a multi-agent system production-ready.

I'm not done yet, though. The next major headache is handling coordinate drift when a PR gets rebased mid-review. If you are building LLM pipelines and wrestling with similar Git parsing nightmares, I would love to hear how you are tackling them in the comments below.

Source Code

You can check out the full implementation of this deterministic guardrail here: Explore Difflens on GitHub. If this deep-dive helped you debug your own pipeline, I'd appreciate a ⭐️ on the project.

Why AI Code Review Tools Keep Commenting on Lines That Don’t Exist

shy The — Mon, 01 Jun 2026 11:51:51 +0000

While experimenting with AI-powered code review systems, I kept running into a strange problem.

The model would generate a perfectly reasonable review comment.

The code issue was real.

The explanation made sense.

But the comment was attached to a line that didn’t exist in the pull request.

At first, I assumed this was just another example of LLM hallucination.

After digging deeper, I found something more specific.

The Problem Isn’t Code Understanding

Most modern LLMs are surprisingly good at understanding code changes.

They can often identify:

Potential bugs
Missing edge cases
Naming issues
Logic problems

The strange part was that many review comments correctly identified a problem while referencing the wrong line.

The model understood what was wrong.

It failed to understand where it was.

The Unified Diff Trap

Most AI review systems operate on unified diffs.

A simplified example looks like this:
@@ -120,7 +120,8 @@
-const timeout = 3000;
+const timeout = 10000;
initialize();
Humans rarely think about line coordinates because GitHub handles them automatically.

For an LLM, however, the situation is different.

The model must reconstruct file positions using:

Hunk headers
Added lines
Deleted lines
Context lines
Running offsets

A single counting mistake can shift every coordinate that follows.

What I Observed

Across repeated testing, several patterns appeared frequently:

Deleted-Line References

The model sometimes generated comments that pointed to deleted lines.

The feedback itself was often valid.

The target location wasn’t.

Coordinate Drift

Large diffs increased the error rate significantly.

After enough additions and deletions, line references would gradually drift away from the intended location.

Out-of-Range Targets

Occasionally, comments referenced line numbers that simply didn’t exist inside the patch.

These comments could not be attached to the pull request at all.

Why Prompt Engineering Wasn’t Enough

My first instinct was to improve prompting.

I tried:

More explicit instructions
Structured outputs
Additional examples
Coordinate reminders

The error rate improved.

It never disappeared.

The reason seems straightforward.

Predicting text and maintaining exact positional bookkeeping are fundamentally different tasks.

A model can understand a code issue while simultaneously making a counting error.

A Different Approach

Eventually I stopped treating coordinates as trusted output.

Instead of assuming the model was correct, I added a deterministic verification step.

Every generated review comment is checked against the actual diff structure before being returned.

The validator verifies:

File existence
Valid patch coordinates
Added-line targets
Hunk boundaries
Out-of-range references

If a comment fails validation, it is either corrected or discarded.

The goal isn’t to make the reviewer smarter.

The goal is to prevent invalid comments from reaching the pull request.

Final Thoughts

One lesson stood out during this project:

Semantic understanding and coordinate accuracy are different problems.

LLMs are often better at the first than the second.

As AI tooling becomes more integrated into developer workflows, deterministic validation layers may become just as important as the models themselves.

I ended up open-sourcing the implementation here:

GitHub: https://github.com/ywu593412-afk/DiffLens

I’m curious whether other developers building AI review systems have encountered similar coordinate-mapping issues.