DEV Community

Cover image for I Thought My AI Code Reviewer Was Finished. Then a Single Hallucinated Line Number Broke Everything.
shy The
shy The

Posted on • Edited on

I Thought My AI Code Reviewer Was Finished. Then a Single Hallucinated Line Number Broke Everything.

GitHub “Finish-Up-A-Thon” Challenge Submission

What I Built

Difflens is a LangGraph-powered automated code review pipeline that analyzes pull requests and posts review comments through GitHub Actions.

I thought building a multi-agent reviewer would be the hard part. I was wrong.

The real challenge was making sure AI-generated comments could actually survive GitHub's API validation.

** Why My PR Comments Kept Disappearing **
Getting the pipeline to trigger GitHub Actions felt like a huge win. But then, I noticed a fatal issue: the comments were disappearing.

The logs showed the LLM was generating brilliant security and logic feedback, but on the actual Pull Request page, nothing was posted. After digging into the Octokit error logs, I found a subtle reliability bug: Coordinate Hallucination.

Even when the model correctly identified a vulnerability, it couldn't reliably anchor that feedback to a valid line in the Git Diff. My system had a strict VerifierNode designed to block invalid API requests. When it saw these hallucinated out-of-bound coordinates, Octokit threw a hard 422 Unprocessable Entity error: Validation Failed: {"resource":"PullRequestReviewComment","code":"invalid","field":"line"}

It silently dropped the comments. A single hallucinated number was wiping out the entire review pipeline.

The Fix: Deterministic Guardrails

During early testing, I discovered that many otherwise valid review comments were never reaching GitHub because their generated coordinates failed validation.

The core issue isn't just the AI being 'wrong'—it's that the system trusts non-deterministic output to interact with a strict API. I implemented a deterministic layer to bridge this gap. Instead of hoping the model gets line numbers right, I parse the raw diff into an index map as a source of truth. The ⁠VerifierNode⁠ uses this map to intercept and sanitize agent outputs before they ever hit the GitHub API.

At the boundary of this layer, I enforce a hard check to strip away any out-of-bounds coordinates or malformed responses before they hit the GitHub API. To generate the 'source of truth' for this check, I implemented the parser below. It processes raw diff hunks into a granular, line-by-line index map. It doesn't just check bounds; it handles line-number offsets to ensure consistent alignment before the agent's feedback is ever posted.

function parseDiffToValidLines(filePatch: string): Set<number> {
  const validLines = new Set<number>();

  // 1. Git Diff Protocol Compatibility: When changes are only 1 line, ',1' is omitted.
  // Regex fix: Added ^ and m flags to anchor at the start of the line, preventing false matches within code content.
  // Make the ',count' for both left and right sides optional capture groups (?:,...)
  const hunkHeader = /^@@ -\d+(?:,\d+)? \+(\d+)(?:,(\d+))? @@/gm;
  let match;

  while ((match = hunkHeader.exec(filePatch)) !== null) {
    const start = parseInt(match[1], 10);
    // If the second group (count) is not matched, it means there is only 1 line of change. Default to 1.
    const count = match[2] ? parseInt(match[2], 10) : 1;

    // 2. Design Choice: Why extract only the line numbers after '+'?
    // Because the GitHub PR Review API strict rules require anchoring comments to valid lines in the NEW file (Right Side).
    // This loop fully extracts all context and added lines within the current block.
    for (let i = start; i < start + count; i++) {
      validLines.add(i);
    }
  }
  return validLines;
}
Enter fullscreen mode Exit fullscreen mode

Implementing this parseDiffToValidLines parser was the real engineering bottleneck. Translating raw unified diffs into a stable TypeScript Set requires meticulous edge-case handling—accounting for context lines, additions, and GitHub API's strict positioning rules.

Copilot turned out to be most useful when dealing with repetitive regex iterations, edge cases, and test scaffolding. Wrestling with standard diff protocols is a nightmare; handling multi-hunk shifts and avoiding encoding offsets is incredibly tedious.

Instead of debugging regex boundary errors for days, I laid out the core mathematical constraints, and Copilot did the heavy lifting. We co-authored the precise regex patterns for hunk headers and built an exhaustive suite of unit tests to smoke out boundary shifts. This tight feedback loop compressed days of frustrating manual diff-parsing into a single afternoon of rapid iteration.
(If the diff is empty or malformed, the regex match fails gracefully, returning an empty set and blocking the comment pipeline entirely.)

The Result:From Defensive Filtering to Proactive Injection

During early testing across 5 intentionally varied dummy pull requests, 9 out of 14 generated comments were rejected due to invalid coordinates. While the sample size is small, it was enough to expose a structural reliability issue. That's a 64% silent failure rate—brilliant engineering ideas lost in the ether simply because the AI couldn't read the Git Diff index accurately.

The breakthrough came when I inverted the architecture. By shifting the deterministic validLines map upstream, I injected it directly into the LLM's prompt context as a hard constraint. The system transitioned from a "try-and-fail" model to a "pre-verified" execution model. By forcing the agent to anchor its reasoning to the deterministic list before generating coordinates, the coordinate hallucination issue was effectively neutralized⁠.

The Next Horizon: From Hackathon to Production-Grade

While the deterministic guardrail solved our immediate coordinate hallucination crisis, taking Difflens to a true enterprise-grade standard requires a structured architectural evolution.

I am currently mapping out the next phase: implementing cross-file comment deduplication using path + line + content hashing, and building a hallucination telemetry pipeline to systematically trace prompt drift. These aren't just features—they are the safety nets required to transition from a single-agent prototype to a production-ready, multi-agent orchestrator.

Copilot was helpful here as well, particularly for prototyping validation flows and data structures around comment deduplication and telemetry collection.

The Impact (Before vs. After)

  • Before (initial benchmark across 5 multi-file PRs): 14 suggestions generated ➔ 5 posted (64% lost due to hallucination)
  • After: 14 suggestions generated ➔ 14 posted (0% lost during testing)

Demo

  1. Automated PR Feedback (GitHub Action):

  2. Engineering Implementation (VS Code):

(Note on the output: The system successfully intercepted an empty validation array. This demonstrates the deterministic guardrail in action—rather than firing a broken API call, the bot executed a graceful fallback.)

Conclusion

I spent days wondering why brilliant code reviews were disappearing into the void, all because of a single hallucinated line number. Building Difflens taught me that we need to stop expecting LLMs to be perfect. The turning point wasn't a better prompt; it was accepting the AI's flaws.

By letting the agents do the creative reasoning, but forcing their output through a ruthless, old-school validation layer, the silent failures finally stopped. Blindly trusting AI "magic" is a production hazard. We have to build a solid, deterministic box for that magic to safely operate in.This is where Copilot continues to be useful—it acted as an engineering accelerator,helping me build the rigid boundaries and safety nets needed to actually make a multi-agent system production-ready.

I'm not done yet, though. The next major headache is handling coordinate drift when a PR gets rebased mid-review. If you are building LLM pipelines and wrestling with similar Git parsing nightmares, I would love to hear how you are tackling them in the comments below.

Source Code

You can check out the full implementation of this deterministic guardrail here: Explore Difflens on GitHub. If this deep-dive helped you debug your own pipeline, I'd appreciate a ⭐️ on the project.

Top comments (1)

Collapse
 
shy_the_a91bfb236d4eeb5bb profile image
shy The

Have you encountered similar hallucination issues in your LLM pipelines?