Vicente Junior

Posted on May 22

I gave Gemini 3.5 Flash a CVE-fix PR to review. It found another bug in the same file.

#devchallenge #googleiochallenge #ai #gemini

Google I/O Writing Challenge Submission

This is a submission for the Google I/O Writing Challenge

Across 3 real production PRs, I asked Gemini 3.5 Flash to do a code review. The model — announced this week at Google I/O 2026 — caught 3 legitimate bugs, hallucinated 0, in roughly 4 seconds per PR. The middle PR was the patch for a known security vulnerability in Fastify (CVE-2026-25223, a validation-bypass). The model flagged a second, unrelated regex bug in the exact file being patched.

Here's what I learned building a code-review agent in about 2 hours with Google's new model.

Why I tested this

At the I/O keynote, Sundar Pichai pitched Gemini 3.5 Flash as "frontier intelligence combined with action" — optimized for agentic coding and long-horizon tasks. Code review is the perfect stress test: it requires reasoning about code semantics, cross-file context, and judgment about what matters.

Reading another 50 hype threads on X felt pointless. So I built the smallest possible agent that could actually use the model on real code, ran it on three concrete PRs, and counted what it got right, what it made up, and what it missed.

The architecture

Three stages, ~80 lines of TypeScript, runs on Node 20+:

INPUT                  PROCESSING                       OUTPUT
─────                  ──────────                       ──────
owner/repo#N    →      1. fetch the .diff URL      →    stdout (colored summary)
                       2. truncate if > 150k chars      out/{slug}.json
                       3. build prompt + schema         out/{slug}.md
                       4. Gemini 3.5 Flash call
                       5. Zod-parse the response

No GitHub token (public PRs use the unauthenticated .diff URL). No octokit. No frameworks. Just the new @google/genai SDK with structured output.

The core

The heart of the pipeline is a single review() function — pass it a diff, get back a typed array of issues:

import { GoogleGenAI } from "@google/genai";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

const IssueSchema = z.object({
  file: z.string(),
  line: z.number().nullable(),
  severity: z.enum(["low", "medium", "high", "critical"]),
  category: z.enum(["bug", "security", "performance", "style", "logic", "maintainability"]),
  message: z.string(),
  suggestion: z.string().nullable(),
});

const ReviewSchema = z.object({
  summary: z.string(),
  issues: z.array(IssueSchema),
});

const SYSTEM_PROMPT = `You are a senior code reviewer. Analyze the unified git
diff below and produce a JSON review.

Rules:
- Flag REAL issues only — no nitpicks, no style preferences.
- Prefer fewer, higher-quality issues over volume.
- Each "message" must explain WHY it matters (impact, not just observation).
- If you cannot see enough context to be sure, lower the severity.

Return the full review as JSON matching the provided schema.`;

async function review(diff: string) {
  const res = await ai.models.generateContent({
    model: "gemini-3.5-flash",
    contents: `${SYSTEM_PROMPT}\n\n--- DIFF ---\n${diff}`,
    config: {
      responseMimeType: "application/json",
      responseJsonSchema: zodToJsonSchema(ReviewSchema),
    },
  });
  return ReviewSchema.parse(JSON.parse(res.text ?? "{}"));
}

A few details worth flagging:

Model string: "gemini-3.5-flash". GA since May 19, 2026.
Structured output: use responseJsonSchema (not the older responseSchema). It validates against the Zod-derived schema and returns conformant JSON. No regex-parsing the response, no try/catch for malformed output.
No temperature tuning: Google explicitly recommends not setting temperature, top_p, or top_k on the 3.5 family — the model handles sampling internally.

Full repo at the end. Now the interesting part.

The three PRs

I picked PRs with very different shapes to see how the model behaved across contexts.

PR	Type	Lines	Why
express#6190	Small refactor	~10	Baseline: clean code, no real issues
fastify#6414	Security-sensitive	+398 / −147	The patch for CVE-2026-25223
express#6100	Small refactor	~15	Different file, different style

Final scorecard

PR #1 (express#6190):    +0  −0   Model agreed: no issues
PR #2 (fastify#6414):    +3  −0   3 hits, 0 hallucinations
PR #3 (express#6100):    +0  −0   Model agreed: no issues
──────────────────────────────────────────────────────────────
Total:                   +3  −0   Zero false positives.

What it caught — the headline

PR #2 is the one that mattered. Fastify pull #6414 rewrote the entire content-type parser to fix a security flaw (CVE-2026-25223) where attackers could bypass body validation by appending a tab character to Content-Type (e.g. application/json\tx). The fix introduced a new ContentType class and replaced the old loose string-matching logic.

This is exactly the kind of high-stakes, security-sensitive refactor where an automated reviewer either earns its place or doesn't.

The model flagged three issues. Here's each one, verified against the actual code.

Hit 1: inconsistent variable use in `existingParser`

MEDIUM · logic — The existingParser method checks contentType === "application/json" and this.customParsers.has(contentType) using the original contentType string instead of the newly calculated, normalized ct variable.

Looking at the new code in lib/content-type-parser.js:

ContentTypeParser.prototype.existingParser = function (contentType) {
  if (typeof contentType === 'string') {
    const ct = new ContentType(contentType).toString()
    if (contentType === 'application/json' && this.customParsers.has(contentType)) {
      return this.customParsers.get(ct).fn !== this[kDefaultJsonParse]
    }
    if (contentType === 'text/plain' && this.customParsers.has(contentType)) {
      return this.customParsers.get(ct).fn !== defaultPlainTextParser
    }
  }
  return this.hasParser(contentType)
}

The model is right. ct is the normalized version, but the conditional guards still test the raw contentType. Since customParsers only holds normalized keys (see line 85: this.customParsers.set(normalizedContentType, parser)), any header with a different case or trailing parameters silently skips the fast path. Subtle, easy to miss in review.

Hit 2: a regex missing its end anchor

HIGH · security — The subtypeNameReg regular expression is missing a trailing $ anchor. Consequently, any string starting with a valid subtype will match successfully.

This one is the headline. In the brand new file lib/content-type.js, the patch defines two parallel regexes:

const typeNameReg     = /^[\w!#$%&'*+.^`|~-]+$/      // has $
const subtypeNameReg  = /^[\w!#$%&'*+.^`|~-]+\s*/    // no $

The subtype regex anchors at the start but not at the end. Inputs like application/json/extra pass the validation gate where they shouldn't. In a PR whose entire purpose is fixing a validation-bypass CVE, a senior reviewer would put this in red on the first pass. The model put it in HIGH on the first pass.

I am not claiming this is itself exploitable at the same severity as the original CVE — the downstream parsers may not be reachable in a way that materializes the bug. But the pattern is exactly the class of issue that did materialize as CVE-2026-25223. Pattern-recognition of dangerous shapes is half of what code review is.

Hit 3: stateful global regex

MEDIUM · bug — The keyValuePairsReg regex is defined globally with the /g flag. Because of this, it is stateful and relies on lastIndex. If parsing throws an exception or future modifications exit the loop early, lastIndex will not reset to 0.

Confirmed at the top of lib/content-type.js:

const keyValuePairsReg = /([\w!#$%&'*+.^`|~-]+)=([^;]*)/gm

Used inside a class constructor with .exec() in a loop. In healthy execution, lastIndex resets to 0 when exec returns null. But the failure mode — exception inside the loop body, or any future break — silently corrupts every subsequent parse for the lifetime of the process. The model's suggested fix (use matchAll instead) is exactly the JavaScript-idiomatic answer.

This is a latent footgun, not a live bug. Severity MEDIUM is arguably high. But it's a real thing the model saw.

What it didn't catch — the honest part

Two failure modes worth being honest about.

Cross-file context. The model only sees the diff. It can't tell whether a function called by the changed code is safe, whether a removed branch was load-bearing somewhere else, or whether tests actually cover the new behavior. For PR #6414 in particular, the upstream callers of the new ContentType class are not in the diff, and the model never reasoned about them.

Severity calibration is rough. The regex-without-anchor is HIGH. The stateful /g is MEDIUM. In practice, those probably want to swap — the regex one is a clear pattern with security relevance, the global-regex one is a latent footgun unlikely to fire. Junior-reviewer instincts.

I also can't conclusively measure what the model missed without reviewing every comment thread on the PR by hand. The merged commit went through multiple rounds of feedback (commits like "address feedback", "refactor algorithm", "appease coverage"), so reviewers did catch things, but how many of those are in-diff issues a tool could have seen versus broader design decisions — I'd need another afternoon to know.

What I'd actually use this for

Three takeaways after running this on real code:

It earns a place as a first-layer pre-review. Specifically: PRs that touch parsers, validators, or anything that consumes external input. The cost is around $0.003 per PR. The cost of not running it is shipping a regex without an anchor on a security-sensitive code path.
It does not replace human reviewers. It cannot reason about distributed state, concurrency, transactions, or anything that requires understanding multiple files in concert.
Hallucination rate was zero in this sample — but the sample is tiny. The literature on similar models suggests false positives in the 15-25% range on real-world PRs. Three out of three being valid is great but is not a benchmark.

The 80 lines of TypeScript that produced this run are on GitHub. Two things that are non-obvious about the setup:

@google/genai v2 uses responseJsonSchema, not responseSchema. Easy to get wrong if you're translating tutorial code from an older Gemini.
Public GitHub PRs expose a .diff endpoint that requires no auth. You don't need octokit for an MVP.

If you try it on PRs with shapes I didn't test — concurrency-heavy, multi-file, generated code — tell me what you find. The interesting question is where the model breaks, not where it works.

Built and tested in May 2026 with Gemini 3.5 Flash, GA two days before publication.