Your linter says 0 warnings.
Your type checker is clean.
Your AI reviewer says LGTM.
And yet the bug ships.
TL;DR — A category of bugs slips past linters, type checkers, and AI reviewers because they only show up when two functions interact in a way neither author anticipated. I built a free, open-source plugin for Claude Code / Codex CLI / Gemini CLI that uses a methodology called semi-formal execution tracing to catch them. Repo + install: github.com/hyhmrright/logic-lens — feedback and PRs welcome.
This post is about a class of bugs that pass every automated check we have, because the bug only appears when two functions interact in a way neither author anticipated — and how a methodology called semi-formal execution tracing can catch them with surprising consistency.
The bug your tools can't see
Here's a real-looking function. It passes lint, passes type checks, passes its unit tests:
def process_order(order_id, items, discount_code=None):
order = db.get_order(order_id)
total = sum(item['price'] * item['qty'] for item in items)
if discount_code:
discount = coupon_service.get_discount(discount_code)
total = total * (1 - discount)
order['total'] = total
order['items'] = items
db.save_order(order)
email_service.send_confirmation(order['email'], total)
Looks fine, right? There are at least three real bugs here, and none of them are visible without tracing across function boundaries:
-
get_discountcan returnNonefor invalid or expired coupons.1 - NoneraisesTypeError. The order is never saved — but the user might assume it was, because no exception is logged at the call site. -
An empty
itemslist silently produces a$0.00order.sum([])returns0. The order is saved and a confirmation email is sent. No invariant catches this. -
If
email_service.send_confirmationraises, the database connection is never explicitly released. Under sustained SMTP failure, the pool exhausts.
A linter can't see any of this. A type checker without strict Optional annotations can't either. And in my experience, plain LLM code review catches maybe one of the three on a good day — usually the most surface-level one.
Why AI reviewers miss interprocedural bugs
A 2026 paper, Agentic Code Reasoning (Ugare & Chandra), measured this directly. When LLMs are asked to review code with unstructured chain-of-thought, they catch interprocedural bugs at 76–78% accuracy. When the same models are forced to follow a structured reasoning template — stating premises explicitly before tracing execution — accuracy jumps to 87–93%.
The gap isn't capability. It's discipline. Without structure, the model pattern-matches on what the code looks like and anchors on the happy path. With structure, it has to actually trace what happens when assumptions break.
The Iron Law: four sections per finding
I built a small open-source project called Logic-Lens that operationalizes this methodology. It's a plugin that works with Claude Code, Codex CLI, and Gemini CLI. The core idea: no finding is reported unless it has all four of these sections.
- Premises — every assumption about name resolution, types, and preconditions, stated explicitly.
- Trace — the actual execution path, step by step, crossing function boundaries.
- Divergence — the exact point where a premise breaks and what happens after.
- Remedy — a fix that addresses the divergence, not its symptom.
For the process_order example above, here's an abbreviated version of what Logic-Lens produces for the first finding:
🔴 L6 — Callee Contract Mismatch:
get_discountmay returnNonePremises:
coupon_service.get_discount(code)is assumed to always return a numeric discount rate between 0 and 1.Trace:
discount_codeis truthy →get_discount(discount_code)is called → result assigned todiscount→total * (1 - discount)is evaluated.Divergence:
get_discountreturnsNonefor expired or invalid codes (documented in its docstring). WhendiscountisNone,1 - NoneraisesTypeError. The call reachesdb.save_orderonly on the happy path.Remedy: Check
if discount is not None:before applying. Or haveget_discountraise a typedInvalidCouponErrorthat the caller handles explicitly.
The difference from "this looks wrong" is that you can audit the reasoning. If the trace is wrong, you can point to which step is wrong. If the premise is bogus, you can correct it. The reasoning is a first-class artifact.
Six skills, one taxonomy
Logic-Lens defines six logic risk dimensions (L1–L6): Shadow Override, Type Contract Breach, Boundary Blindspot, State Mutation Hazard, Control Flow Escape, Callee Contract Mismatch. Every finding is labeled with one of these so reports are scannable.
The project ships six commands:
-
/logic-review— find behavioral bugs via execution tracing -
/logic-explain— step-by-step execution explanation that crosses function boundaries -
/logic-diff— semantic equivalence check between two versions of a function -
/logic-locate— root cause localization for a failing test or crash -
/logic-health— aggregate logic health score across a codebase -
/logic-fix-all— autonomous audit-and-fix pipeline for an entire repo
Install on Claude Code:
/plugin marketplace add hyhmrright/logic-lens
/plugin install logic-lens@logic-lens-marketplace
Then /logic-review and paste any function. Gemini CLI and Codex CLI have equivalent install paths in the README.
What this is not
It's not a replacement for your linter — your linter still catches the syntax-level stuff much faster. It's not a static analyzer in the academic sense; it doesn't execute code or build an AST. It's a structured prompting methodology that turns the LLM into a more disciplined reviewer.
What it's good at: callee contract violations, state mutation hazards, and control flow escapes — the bugs that cause production incidents in syntax-clean, lint-passing code.
I'd love help making it better
The project is MIT-licensed: github.com/hyhmrright/logic-lens
The most valuable contribution right now is new eval test cases — especially interprocedural bugs from real production incidents you've encountered. There's an issue template that walks you through the format: paste the minimal repro, label the L1–L6 category, and describe what the four sections (Premises/Trace/Divergence/Remedy) should contain.
If you try it on real code and it catches something useful — or misses something it shouldn't — please open an issue. The benchmark suite genuinely needs more coverage, and that's where the project gets sharper over time.
If this resonates, a ⭐ on the repo helps others find it. Thanks for reading.
Top comments (0)