Darko from Kilo

Posted on Jan 14 • Originally published at blog.kilo.ai

We Tested Three Frontier Models on Kilo's AI Code Reviews

#ai #codereview #discuss

We recently tested three free models on Kilo's Code Reviews: Grok Code Fast 1, MiniMax M2, and Devstral 2. All three caught critical security vulnerabilities like SQL injection and path traversal. We wanted to see how state-of-the-art frontier models compare on the same test, so we ran GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro through identical pull requests.

TL;DR: GPT-5.2 found the most issues (13) including a security bug no other model caught. Claude Opus 4.5 was fastest at 1 minute with perfect security detection. All three frontier models caught 100% of SQL injection vulnerabilities.

Testing Methodology

We used the same test PR from our previous evaluation. The base project is a TypeScript task management API built with Hono, Prisma, and SQLite. The feature branch adds user search, bulk operations, and CSV export functionality across 560 lines in four new files.

The PR contains 18 intentional issues across six categories:

Each model reviewed the PR with Balanced review style and all focus areas enabled. We set the maximum review time to 10 minutes, though none of the models needed more than 3.

Results Overview

"Issues Found" includes both planted issues and additional findings the models caught on their own. Detection rates shown later in the post are based only on the 18 planted issues.

All three models correctly identified both SQL injection vulnerabilities, the path traversal risk, and the CSV formula injection. They also caught the loop bounds error that would cause undefined array access.

None of the models produced false positives. Every issue flagged was a real problem in the code.

What Each Model Did Well

GPT-5.2

GPT-5.2 completed its review in 3 minutes and found the most issues (13 total). It was the only model to catch two issues that the others missed entirely.

Authorization bypass in task duplication. The API has a bulk duplicate endpoint that copies tasks. It accepts an optional parameter specifying who should own the copied tasks, but any user can set this to any other user's ID and create tasks in their account. GPT-5.2 flagged this as a critical authorization bypass:

Neither Claude Opus 4.5 nor Gemini 3 Pro caught this vulnerability.

The synchronous file write. The export endpoint uses fs.writeFileSync() which blocks the Node.js event loop. For large exports, this freezes all other request handling. GPT-5.2 was the only model to flag this performance issue.

GPT-5.2 also identified that the task search endpoint returns all tasks in the system instead of just the current user's tasks. Any authenticated user could search and view tasks belonging to other users.

Output format: GPT-5.2 posted inline comments on specific code lines with severity labels and a summary table. Each finding included impact assessment and recommended fixes with code examples.

Note that the code review summary shows 9 inline comments, but 4 additional issues were identified in the additional description, bringing the total to 13.

Claude Opus 4.5

Claude Opus 4.5 completed its review in 1 minute, the fastest of the three frontier models. It found 8 issues total (6 critical, 2 lower severity).

The pagination offset bug. The search endpoint calculates offset as page * limit. For page 1 with limit 50, this skips the first 50 results and returns results 51-100 instead. Both GPT-5.2 and Claude Opus 4.5 caught this. Gemini 3 Pro did not.

Claude Opus 4.5's inline comment included a suggested change block:

Claude Opus 4.5 was the only model to flag the inconsistent naming convention. The bulk operations file uses updated_count and failed_count (snake_case) while the rest of the codebase uses camelCase. This is a minor style issue, but it indicates Claude Opus 4.5 was analyzing the code against codebase patterns, not just looking for security bugs.

Output format: Claude Opus 4.5 used inline comments with "Suggested change" diff blocks showing before/after code. The summary grouped findings by severity with collapsible details. The format is concise and immediately actionable.

Note that the code review summary shows 6 inline comments, but 2 additional issues were identified in the additional description, bringing the total to 8.

Gemini 3 Pro

Gemini 3 Pro completed its review in 2 minutes with 9 issues found. It caught something important that Claude Opus 4.5 missed.

The N+1 query pattern. The task search iterates over results and executes a separate database query for each task to fetch assignee information. With 100 tasks, this runs 101 database queries instead of 1. Gemini 3 Pro flagged this.

GPT-5.2 also caught the N+1 pattern. Claude Opus 4.5 did not.

Gemini 3 Pro found the swallowed error in the bulk update loop. The try-catch block catches exceptions but does nothing with them, making debugging impossible if updates fail:

It also identified a separate CSV issue. When exporting tasks, the owner's name and email are written directly to the file without proper escaping. If someone's name contains a comma (like "Smith, John"), it breaks the CSV column alignment and corrupts the export.

What Gemini 3 Pro missed: The missing admin authorization on the /export/all endpoint. This endpoint lets any authenticated user export any other user's tasks. The code comment even says "admin only" but there's no role check. Both GPT-5.2 and Claude Opus 4.5 caught this. Gemini 3 Pro did not.

Output format: Gemini 3 Pro posted inline comments with explanations and code suggestions. The summary separated issues by severity (CRITICAL, WARNING) with expandable file-by-file details.

Detection Rates by Category

Security detection was strong across all three models. GPT-5.2 and Claude Opus 4.5 achieved 100% on planted security issues. Gemini 3 Pro missed the admin authorization check.

Performance detection varied widely. GPT-5.2 caught two of three performance issues (N+1 queries and sync file writes). Gemini 3 Pro caught one (N+1 queries). Claude Opus 4.5 caught none, focusing instead on security and correctness bugs.

Security Issue Breakdown

For catching SQL injection and path traversal vulnerabilities, all three models performed equally. The difference appeared in authorization logic, where Gemini 3 Pro missed the admin check.

Additional Findings Beyond Planted Issues

Each model also identified issues we hadn't explicitly planted:

GPT-5.2 was the only model to catch the task duplication bypass. This is a real security vulnerability that would allow users to create data in other users' accounts.

What All Three Missed

No model detected these issues:

The race condition is the biggest miss. The bulk assign endpoint first checks if the user owns a task, then updates it in a separate database call. If two requests hit the server at the same time, or if a task gets deleted between the check and the update, the data can become corrupted. Detecting this requires understanding that the two operations can interleave with other requests.

How Do Frontier Models Compare to Free Models?

We ran the same test on three free models available in Kilo: Grok Code Fast 1, MiniMax M2, and Devstral 2. Here's how the results compare:

Claude Opus 4.5 tied with Grok Code Fast 1 on detection rate (44%) while being faster (1 min vs 2 min). Gemini 3 Pro's detection rate (39%) was lower than Grok's despite being a frontier model.

Security Detection: Free vs Frontier

For security issues specifically, Grok Code Fast 1 (free) matched the best frontier models. It caught all five planted security vulnerabilities: both SQL injections, the missing admin check, path traversal, and CSV injection.

Where Frontier Models Add Value

The frontier models showed advantages in two areas:

Performance pattern detection. GPT-5.2 and Gemini 3 Pro both caught the N+1 query pattern. None of the free models detected any performance issues.

Deeper authorization analysis. GPT-5.2 found the task duplication bypass that no other model (free or frontier) caught. This required understanding that the parameter allows users to create tasks in other users' accounts, not just that the parameter exists.

Where Free Models Hold Their Own

For the core job of catching SQL injection, path traversal, missing authorization, and obvious bugs, Grok Code Fast 1 performed at the same level as two of the three frontier models. The gap between free and frontier was smaller than we expected.

Verdict

GPT-5.2 found the most issues. It caught security vulnerabilities (the task duplication bypass) and performance issues (synchronous file writes) that no other model found. For pre-release audits or security-sensitive changes, it provides the widest coverage. The 3-minute review time is acceptable when thoroughness matters.

Claude Opus 4.5 offers the best balance of speed and security detection. Perfect security scores in 1 minute makes it practical for high-velocity teams that want to run reviews on every commit without blocking developers.

Gemini 3 Pro caught performance patterns (N+1 queries) that Claude Opus 4.5 missed, but it also missed a critical authorization check that even the free models caught. The gap between its detection rate (39%) and Grok Code Fast 1's (44%) was unexpected for a frontier model. Consider pairing with manual review for authorization-heavy code.

Grok Code Fast 1 (free) matched Claude Opus 4.5's detection rate while catching all five planted security vulnerabilities. For teams that want security screening without cost, it delivers the same core value as the paid models.

The most interesting finding was how well the free models held up. Grok Code Fast 1 matched or beat two of the three frontier models on overall detection while catching 100% of security issues. For catching SQL injection, path traversal, and missing authorization, smaller models have become competitive with frontier options. The free tier catches the issues that matter most at the same rate as the expensive models.

For teams that need the widest coverage, GPT-5.2 is the best option. For everyone else, the free models do the job.

Testing performed using Code Reviews, a feature of Kilo---the free, open-source, end-to-end agentic engineering platform with IDE extensions for VS Code and JetBrains. Join the over 1M Kilo Coders already building at Kilo Speed.

DEV Community