DEV Community

hyhmrright
hyhmrright

Posted on

Why AI Code Review Misses Logic Bugs — And How Structured Execution Tracing Fixes It

Your CI is green. Your linter is happy. The PR has three approvals. And yet — three weeks later, 2 a.m. PagerDuty.

Sound familiar?

The bugs that cause real production outages rarely look wrong. They pass lint. They pass review. They often pass tests. They emerge from the interaction between two functions, where neither author anticipated what happens when their assumptions silently collide.

This is the problem Logic-Lens was built to solve.


The Root Cause: Pattern Matching vs. Reasoning

When you ask an AI to review code without structure, it pattern-matches. It compares your code to patterns it has seen before. This works well for style. It fails for logic bugs — because logic bugs live in syntax-clean, lint-passing code that looks perfectly normal.

The research is unambiguous. Models using structured semi-formal reasoning achieve 87–93% accuracy on interprocedural code semantics tasks. Unstructured chain-of-thought: 76–78% — and the gap is largest on exactly the class of bugs that cause production incidents.

The difference isn't model capability. It's methodology.


The Fix: Structured Execution Tracing

The key insight: force the model to build an explicit execution trace before reaching any conclusion.

Instead of "does this look right?", the model must:

  1. State premises — every assumption about types, nullability, and preconditions
  2. Trace execution — follow the actual path step by step, crossing function boundaries
  3. Identify divergence — find the exact point where a premise breaks
  4. Prescribe remedy — fix the root cause, not just the symptom

This is the methodology behind Logic-Lens — an open-source plugin for Claude Code, Codex CLI, and Gemini CLI that enforces structured execution tracing on every code review.


What It Looks Like in Practice

Here's a deceptively ordinary Python function:

def process_order(order_id, items, discount_code=None):
    order = db.get_order(order_id)
    total = sum(item['price'] * item['qty'] for item in items)

    if discount_code:
        discount = coupon_service.get_discount(discount_code)
        total = total * (1 - discount)

    order['total'] = total
    order['items'] = items
    db.save_order(order)
    email_service.send_confirmation(order['email'], total)
Enter fullscreen mode Exit fullscreen mode

Looks fine. Three approvals. Ships to production. Here's what Logic-Lens produces:

Logic Health: 31/100

🔴 L6 — Callee Contract Mismatch
   Premises: coupon_service.get_discount(discount_code) → assumed float
   Trace: get_discount returns None for expired codes (documented in coupon_service.py:47)
   Divergence: total * (1 - None) raises TypeError at runtime
   Remedy: Guard with `if discount is not None` before applying. Add contract test.

🔴 L3 — Boundary Blindspot
   Premises: items assumed non-empty
   Trace: sum() over [] returns 0 → order saved with total = $0.00
   Divergence: No validation before db.save_order
   Remedy: Assert len(items) > 0 or raise ValueError("Order must have at least one item")

🟡 L5 — Control Flow Escape
   Premises: email_service.send_confirmation assumed non-throwing
   Trace: SMTPException propagates before db connection cleanup
   Divergence: Connection pool exhausted under sustained email failures
   Remedy: Wrap email send in try/finally; release connection unconditionally
Enter fullscreen mode Exit fullscreen mode

Every finding includes all four sections — Premises, Trace, Divergence, Remedy. That's the Iron Law of Logic-Lens: no finding ships without showing its work.


Nine Logic Risk Categories

Logic-Lens evaluates code across nine dimensions:

Code Name What It Catches
L1 Shadow Override Variable shadowing across scopes
L2 Type Contract Breach Type assumptions that break at runtime
L3 Boundary Blindspot Edge cases (empty, zero, max)
L4 State Mutation Hazard Shared mutable state side effects
L5 Control Flow Escape Exception paths that skip cleanup
L6 Callee Contract Mismatch Return value assumptions that fail
L7 Concurrency/Async Hazard Race conditions, await misuse
L8 Resource Lifecycle Issue Leaked connections, handles, memory
L9 Time/Locale Hazard Timezone, clock, and locale bugs

How It Compares

Logic-Lens ESLint/Pylint GitHub Copilot Review Plain AI
Explicit execution trace
Premises → Trace → Divergence → Remedy
Interprocedural bug detection ~ ~
Zero config, any language
Auditable / reproducible reasoning

Logic-Lens doesn't replace your linter. It catches what linters structurally cannot: behavioral bugs in syntax-clean code.


Benchmark: 91% vs. 19%

Across three real-world codebases with documented production bugs:

  • Logic-Lens: 91% pass rate on interprocedural, boundary, and state-mutation scenarios
  • Plain AI (unstructured): 19%

The gap isn't what the model can find with perfect prompting. It's what it consistently finds, across every run, with a traceable reasoning chain that shows its work every time.


Six Skills, One Install

logic-lensFull structured trace (the full review)
logic-lens-quickFast path for time-sensitive reviews
logic-lens-securityOWASP-mapped security focus
logic-lens-perfBottleneck and complexity hunting
logic-lens-diffPR diff review (interprocedural focus)
logic-lens-reportTeam-ready output with severity scoring
Enter fullscreen mode Exit fullscreen mode

Install in 60 Seconds

Claude Code:

/plugin marketplace add hyhmrright/logic-lens
/plugin install logic-lens@logic-lens-marketplace/logic-review```
{% endraw %}
**Gemini CLI:**
{% raw %}

Enter fullscreen mode Exit fullscreen mode

/extensions install https://github.com/hyhmrright/logic-lens




**Codex CLI:**
See the [README](https://github.com/hyhmrright/logic-lens) for the skill installer command.

---

## Try It

If you've shipped a bug that passed review, it's worth running Logic-Lens on the function that caused it. The trace output is often illuminating even in retrospect.

⭐ [github.com/hyhmrright/logic-lens](https://github.com/hyhmrright/logic-lens)

**Which of the nine risk categories (L1–L9) have you hit most in production?** Drop a comment — happy to run Logic-Lens on a representative example and share the raw output.

---

## Related

If you also care about *why* your architecture has decay risks — not just where behavioral bugs live — I wrote a companion piece grounding AI code review in 12 classic engineering books:

[Show DEV: brooks-lint — an AI code reviewer that cites Fowler, Martin, and Brooks](https://dev.to/hyhmrright/i-synthesized-12-classic-engineering-books-into-an-ai-code-reviewer-heres-what-it-caught-3ed1)

The two tools cover different failure modes and work well together: Logic-Lens catches runtime behavioral bugs via execution tracing; brooks-lint diagnoses architectural decay against Fowler, Martin, Evans, and nine others.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)