AI Code Review: When to Trust the Suggestion

#codereview #aicoding #workflow #cursor

This article was originally published on aicoderscope.com

The core problem with AI coding tools is not that they produce bad code. It's that they produce plausible-looking bad code. A broken SQL query and a correct one look identical at first glance. An authentication bypass can fit in two lines. Confident, readable, wrong.

This is the trust problem. And "just review it carefully" is not a decision rule — it's a way of saying you haven't thought it through yet.

This article gives you a concrete framework: not "AI is good or bad" but a tiered decision system, broken down by suggestion type, so you can calibrate your attention where it actually matters.

Why Trust Is the Core Problem

GitHub's own research on Copilot found acceptance rates in the 30–35% range for inline completions across languages. Cursor Tab reports similar numbers internally. Roughly two-thirds of suggestions get rejected — and that's for developers who are already experienced with the tool and have trained themselves on what's plausible vs. what's real.

That 30–35% figure is interesting in both directions. It means:

Trusting too much: accepting broken suggestions, especially ones that look correct but have subtle logic errors, wrong method signatures from stale training data, or security flaws. You pay later in debugging time and, in the worst case, incidents.
Rejecting too much: dismissing good suggestions because you don't trust the tool. This is also waste — you're running an AI assistant and then ignoring most of its output. The throughput benefit evaporates.

Both failure modes waste time. The goal is accurate trust calibration, not maximizing acceptance or maximizing skepticism.

A 2022 Stanford study ("Do Users Write More Insecure Code with AI Assistants?") found that developers using AI coding assistants produced security vulnerabilities at measurably higher rates — specifically because they trusted confident-sounding suggestions in exactly the domains where AI is least reliable: authentication logic, cryptography, and input validation. The NYU "SecurityEval" dataset (2023) documented over 150 distinct vulnerability patterns in AI-generated code. Both studies predate the current generation of models, but the failure modes are architectural: LLMs optimize for plausibility, not correctness, and security code is full of non-obvious invariants that look fine to a pattern-matcher.

The fix isn't to distrust AI. It's to distrust it selectively and systematically.

The Trust Taxonomy

Here is a three-tier framework. Every suggestion you're about to accept fits into one of these buckets. The tier tells you how much attention to spend before hitting Tab or accepting the diff.

Tier 1 — High Trust: Accept with a Quick Scan

These suggestion types are low-risk. AI rarely breaks them, and when it does, the failure is obvious and easy to catch.

Boilerplate generation: Class scaffolding, interface declarations, test describe/it structure, standard import blocks. These follow rigid patterns with minimal variation. If the AI fills in a @Service-decorated Spring class or a pytest fixture, it's almost certainly correct. Quick-scan for obvious typos and move on.

Type annotation completion: LLMs are genuinely strong at inferring types from context. If you have a function that takes a User object and returns a list of Post objects, the AI's type signature is almost always right. In TypeScript and Python especially, these suggestions save real time with very low error rates.

Documentation and comments: The worst-case outcome is a comment that's slightly imprecise. It will not ship a bug. Accept freely, read once, adjust if it's wrong about intent.

Simple utility functions with obvious implementations: String formatting, date arithmetic, basic array filtering, number formatting — functions with one obvious correct implementation. If there are three ways to format a phone number and two of them are wrong, the AI usually picks the right one. For single-path-to-correctness implementations, accept and run the tests.

CSS and styling: Visual output is verifiable in under three seconds. If the suggestion makes the button the right color, it's correct. Styling is self-testing.

The quick-scan discipline for Tier 1: eyes on the variable names and any hardcoded values. The pattern is right; the values might not match your context.

Tier 2 — Medium Trust: Read Carefully Before Accepting

These are suggestions that look correct and usually are — but have a class of failure modes that's expensive if you miss them. Spend the time to actually read the suggestion before accepting.

Database queries: The structure is almost always syntactically correct. The problems are semantic and performance-related: N+1 query patterns that look fine until you hit 10,000 records, missing indexes on the columns you're filtering by, wrong JOIN type (LEFT vs. INNER when it matters), or implicit LIKE queries that kill performance on large tables. Read the query. If you're not immediately sure what it does to query volume, run EXPLAIN before shipping it.

Error handling: AI has a strong training signal toward "add a try/catch." The catch blocks it produces are often too broad (catch (e) {} that silently swallows everything) or log the error and then continue in a state that's invalid. Read every catch block the AI writes. Check that it re-throws when appropriate, that it doesn't catch exception types it doesn't understand, and that it doesn't log sensitive data (stack traces with connection strings, user data).

Algorithm implementations: The pseudocode logic is usually correct. The implementation often isn't optimal. O(n²) where O(n) is straightforward. Nested loops where a hash map would do. The AI wasn't penalized for inefficiency in its training data — most code that gets merged is correct first, fast second. For any algorithm with a non-trivial input size, check the time complexity before accepting.

API client code: This is the training-data-staleness problem. The AI was trained on docs and Stack Overflow threads from some point in the past. SDK method signatures change. Auth flows evolve. Deprecated methods get removed. An AI suggestion that calls aws.s3.putObject() with parameter names from 2023 will pass the linter and fail at runtime. For any third-party API call, verify the method signature against the current official docs — not Stack Overflow, the official docs — before accepting.

Tier 3 — Low Trust: Always Verify Manually

These are domains where the cost of a wrong suggestion is high, the error is non-obvious, and AI failure rates are structurally elevated. Do not accept these without a deliberate manual review, regardless of how confident the suggestion looks.

Authentication and authorization logic: A single logic error here is a breach. The AI does not understand the threat model of your application — it understands the pattern of auth code. A if (user.role === 'admin' || user.id === id) check that should be && instead of || looks correct syntactically. An RBAC check that passes when the resource doesn't exist instead of failing safe. These errors are common in AI-generated auth code precisely because the shape of the pattern is right and only careful reading of the logic catches the inversion. Treat AI auth suggestions as a draft, not a final answer.

Cryptography: Never accept AI crypto code without a security audit. The failure modes are invisible. Incorrect IV reuse in AES-CBC, using ECB mode because it was in a tutorial the AI trained on, storing derived keys instead of salting properly — these produce code that functions correctly in tests and is catastrophically broken in production. Use established, audited libraries (argon2, libsodium, bcrypt) and only accept AI suggestions for the invocation of those libraries, not for any crypto logic itself.

Concurrency — mutexes, races, async coordination: AI models produce plausible concurrent c