Andrej

Posted on Mar 30

Claude CLI vs API for Code Review: Same Model, Wildly Different Results

#opensource #webdev #typescript #productivity

I stopped writing code by hand a while ago. Claude writes it, I review it, it ships. It works, so why should I?

But here's the thing -- if AI writes all the code, who reviews it? Another AI, obviously. So I built brunt, an adversarial code review tool that throws LLMs at your diffs to find bugs and security issues.

The problem is: which AI do you point it at? I have a Claude subscription (CLI access), and I have an API key. Same company, same models. Should give the same results, right? I also gave Ollama a try, didn't make the cut.

I tested this against a real refactor on my Rust/Axum backend -- replacing four old subsystems with a new AI scenarios feature. 20 commits, 77 files, +1,566 / -5,900 lines. I ran brunt three ways:

Claude CLI -- uses your Claude subscription via claude -p
Anthropic API (Sonnet) -- claude-sonnet-4-6 via HTTP
Anthropic API (Opus) -- claude-opus-4-6 via HTTP

Same diff. Same tool. Same prompts. Wildly different results.

The results

Seven findings vs eighty-four findings. Same model family, same prompts. What happened?

More findings is not better

The CLI run found 7 issues. Every single one was a real, actionable bug. The best catch: a missing .await on an async function call that silently dropped a Future -- the scenario trigger would never fire.

// Bug: this creates a Future but never polls it
state.scenario_trigger.on_activity_created(
    user.tenant_id, &activity, &state
);
// Should be:
state.scenario_trigger.on_activity_created(
    user.tenant_id, &activity, &state
).await;

Rust compiles this without error. It just silently does nothing. That is exactly the kind of bug you want an AI reviewer to catch.

Sonnet's 84 findings included a lot of noise. It flagged bugs in deleted code -- code that no longer exists in the codebase. It reported concerns about parameter binding in functions that were entirely removed in the same PR. Technically correct observations about the diff in isolation, but not real bugs.

Opus found 44 issues. Eight were marked critical -- but they were all "removed module declaration breaks dependents." True if you only see one file, false when you realize the dependents were also removed in the same PR. The model couldn't see across files.

The takeaway: A noisy reviewer that cries wolf on 84 issues trains you to ignore findings. A precise reviewer that surfaces 7 real concerns gets your attention.

The debugging adventure

Before I got these results, I spent an hour (really, an hour?) debugging why the API runs returned zero findings.

The first API run completed in 3 seconds for 75 files and found nothing. That's suspicious -- 75 LLM calls in 3 seconds is physically impossible. Something was silently failing.

Bug 1: Wrong model ID

The config had claude-sonnet-4-6-20250514 as the model. The Claude CLI resolves this alias fine. The Anthropic API does not -- it returns 404. Every single API call was failing.

But brunt uses Promise.allSettled to collect per-file results:

const perFileResults = await Promise.allSettled(
  files.map((file) => vector.analyze([file], context, provider))
);
const findings = perFileResults.flatMap((r) =>
  r.status === "fulfilled" ? r.value : []
);

Rejected promises get silently mapped to empty arrays. Zero findings, zero errors shown to the user. The output? "No issues found." Completely misleading.

Bug 2: No concurrency limiting

The engine fires all 75 files simultaneously as parallel API calls. Per vector, that is 75 concurrent requests. Two vectors = 150 concurrent HTTP requests hitting Anthropic at once.

Result: mass rate limiting (HTTP 429). Same silent failure -- all rejected promises dropped.

Bug 3: Output token limit too low

The default max_tokens was 4096. For a file that produces many findings, the response gets truncated mid-JSON. The parser fails. Zero findings.

Bug 4: No retry on rate limiting

The provider threw immediately on any non-200 response. No backoff, no retry. One 429 and the analysis for that file is gone.

All four bugs shared one pattern: the tool silently produced empty results instead of erroring. "No issues found" when the analysis didn't actually run. This is worse than a crash -- it builds false confidence.

The fixes

Concurrency limiting -- added a worker pool that limits API calls to 5 at a time:

async function runWithConcurrency<T>(
  tasks: (() => Promise<T>)[],
  concurrency: number
): Promise<PromiseSettledResult<T>[]> {
  const results: PromiseSettledResult<T>[] = new Array(tasks.length);
  let nextIndex = 0;

  async function worker() {
    while (nextIndex < tasks.length) {
      const idx = nextIndex++;
      try {
        results[idx] = { status: "fulfilled", value: await tasks[idx]!() };
      } catch (reason) {
        results[idx] = { status: "rejected", reason };
      }
    }
  }

  const workers = Array.from(
    { length: Math.min(concurrency, tasks.length) },
    () => worker()
  );
  await Promise.all(workers);
  return results;
}

Retry with backoff -- exponential backoff on 429/529, respecting retry-after headers:

if (response.status === 429 || response.status === 529) {
  if (attempt === MAX_RETRIES) {
    throw new Error(`API error (${response.status}) after ${MAX_RETRIES} retries`);
  }
  const retryAfter = response.headers.get("retry-after");
  const backoff = retryAfter
    ? parseInt(retryAfter, 10) * 1000
    : INITIAL_BACKOFF_MS * Math.pow(2, attempt);
  await sleep(backoff);
  continue;
}

Bumped max_tokens to 16384 and used correct model IDs (claude-sonnet-4-6, not claude-sonnet-4-6-20250514).

After these fixes, the API runs actually worked -- and produced real results.

Non-determinism is real

I ran Claude CLI twice on the same diff. First run: 10 findings, canary detected. Second run: 7 findings, canary missed. That is a 30% variance between identical runs.

Brunt plants a synthetic bug (a "canary") in the diff and checks if the model catches it. It is a clever reliability signal -- if the model misses the canary, results are flagged as potentially unreliable. But the canary itself is non-deterministic. Claude caught it once, missed it the next time.

This means you cannot use a single AI review run as a pass/fail gate. The same code will get different reviews depending on when you run it. If you are building CI pipelines around AI review tools, you need to account for this -- run multiple times, take the union of findings, or use a consensus mechanism.

CLI vs API: the real differences

Claude CLI and the Anthropic API can use the same underlying model. But for tool builders, the experience is very different:

	Claude CLI	Anthropic API
Rate limiting	Handled for you	Build it yourself
Retries	Built in	Build it yourself
Model aliases	Work seamlessly	Must use exact model ID
Default model	Your subscription tier	Must specify explicitly
Cost	Included in subscription	Pay per token (~$1/run)
Speed	Slower (subprocess per call)	Faster once infrastructure is right
Concurrency	Managed internally	You control it

The CLI is the easier path. You get rate limiting, retries, model resolution, and a generous context window for free. The API gives you more control but you are responsible for everything -- and as I learned, getting "everything" right is harder than it looks.

What the models actually found

Across all runs, here are the real issues that held up to manual review:

Missing .await on async call -- scenario trigger future silently dropped
CSV formula injection in data export -- unsanitized strings like =HYPERLINK("http://attacker.com") execute in Excel
TOCTOU race condition -- duplicate scenario executions between has_pending check and create
Negative LIMIT in SQL -- ListExecutionsQuery.limit accepts negative values, PostgreSQL treats negative LIMIT as unlimited
Cursor pagination bug -- paginating on non-unique created_at silently skips records with identical timestamps
Shutdown signal ignored -- 180-second initial sleep blocks graceful shutdown
No schema validation on UpdateAiConfig -- accepts arbitrary JSON, downstream code indexes into it blindly

Every one of these is a real bug that a human reviewer might miss. The missing .await is particularly nasty -- it compiles, it runs, it just silently does nothing.

Lessons for AI tool builders

1. Fail loudly. "I couldn't analyze this file" is infinitely better than "I found nothing" when the analysis didn't run. Silent failures in AI tooling are worse than crashes because they build false confidence.

2. More findings is not better. Optimize for signal-to-noise ratio, not raw count. 7 actionable findings beat 84 noisy ones.

3. Account for non-determinism. LLM outputs vary between runs. If your tool is a CI gate, run it multiple times or implement consensus logic.

4. Per-file analysis has blind spots. Analyzing files independently misses cross-file issues. Consider a hybrid: per-file scan for depth, plus one cross-file summary pass for systemic issues.

5. Test your tool with the actual provider you ship. I had four bugs that only manifested with the API provider because development and testing happened with the CLI. If you support multiple backends, test all of them.

The numbers, one more time

	Claude CLI	API Sonnet	API Opus
Findings	7	84	44
Real bugs	7	~15	~12
False positives	0	~69	~32
Signal ratio	100%	~18%	~27%
Cost	Subscription	$1.11	$1.67
Time	1m 47s	7m 20s	8m

The CLI run was the most useful: fastest, cheapest, highest signal. But it analyzed fewer files due to subprocess limits. The API runs were thorough but noisy -- they need better filtering to be practical.

The ideal setup might be: CLI for fast feedback in development, API with deduplication and filtering for thorough CI reviews.

This is just a demo and should not be used in production systems, personally I have something similar to test my code but with multiple features whereas this tool is more like a general tool to test the model benchmarks, maybe someone gets an idea or inspiration to try something else in their stack. <--- This section was actually written by me.

No models were harmed during the test.

P.S. brunt can also generate failing tests and fixes for the issues that were found, but thats a story for another time, thanks for reading!

The codebase reviewed is a real Rust/Axum backend with multi-tenant isolation, async patterns, and PostgreSQL. All findings shown are from actual brunt runs, not curated or cherry-picked.

Top comments (2)

Chen Zhang • Mar 31

the silent failure pattern you found is honestly scarier than the non-determinism imo. at least variance you can average out, but "no issues found" when nothing actually ran would pass CI every time

Andrej • Mar 31

yeah, at first it was weird when ollama even ran for couple of minutes and didnt find anything, and then seeing sonnet dropping after couple of seconds was weird.

Also brunt can ran on itself to review each new version and fix itself, I might do a series on this.