Thoughts after playing around with GitHub's `/security-review` command

D Cairo — Fri, 29 May 2026 11:41:44 +0000

I was setting up Copilot CLI on my work account last week and came across an experimental /security-review command. I didn't see any announcement for it, so I was curious how it worked and poked around a little.

The short version of what it does: you finish your coding session, it reads the diff, and it produces a list of likely vulnerabilities. Useful on paper. The thing I couldn't tell from poking at it manually was how much the underlying model matters. Does picking Opus over Haiku actually buy you better security findings, or are you just paying for the same answer in a fancier wrapper?

So I built a small harness around OWASP Juice Shop to find out. This was a small scale experiment funded by my work subscription. This post is what fell out of that.

The setup

I needed a target with a known answer key, and Juice Shop is an app I've poked at before. It's a demo vulnerable Node.js app that ships with a catalogue of known issues. I took the original app, and created 10 changes from existing vulnerabilities. Each change is simply reintroducing one or more catalogued vulnerabilities. There were 14 vulnerabilities in total across the 10 changes:

SQL injection
Weak crypto
SSRF
Path traversal
XXE
Insecure deserialization
Broken access control
Hardcoded credentials
Missing rate limiting
Open redirect

The ground truth, with file, CWE, and one-line explanation, lives in a catalogue.md. The AI reviewer never sees this file during the /security-review process.

For each change, I run /security-review non-interactively and capture the output.

The --no-ask-user flag matters. Without it the command seems to pause for input after its initial pass and never terminates in a script. With it, you get a clean JSON stream and a final result event that includes the credits the run consumed.

Then a separate, fixed LLM grader takes the catalogue and the reviewer's output and produces three counts per change: detected, missed, false positives. The grader sees the catalogue. The reviewer doesn't. The grading model stays constant across all runs so any grader bias is a constant offset. I decided to go big on this one and used Opus 4.6.

I ran this across 5 models × 4 independent runs × 10 changes = 200 reviews. It's a small sample, but tokens are expensive these days and I was funding this out of curiosity, not a budget. I think it's enough to see the broad shape and maybe make plans for future work.

Models tested: Claude Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.4, GPT-5.5. These are all the ones currently selectable for Copilot CLI.

What came out

Mean detection rate across 4 runs, with range and standard deviation:

claude-opus-4.6     93%   (93–93,  σ 0.0)   2.5 FPs   30.0 credits/run
gpt-5.5             91%   (86–93,  σ 3.6)   0.8 FPs   75.0 credits/run
claude-sonnet-4.6   86%   (79–93,  σ 8.2)   0.8 FPs   10.0 credits/run
claude-haiku-4.5    86%   (79–93,  σ 5.8)   1.2 FPs    3.3 credits/run
gpt-5.4             77%   (71–79,  σ 3.6)   0.2 FPs   10.0 credits/run

Two things stood out enough that I felt they were worth writing up.

1. Haiku 4.5 ties Sonnet 4.6 on mean detection at ~1/3 the cost

Both landed at 86% mean detection. Haiku costs 3.3 credits per 10-change sweep; Sonnet costs 10. That's a 3× spread for the same outcome on this benchmark.

If you're planning to run /security-review on every PR in a busy repo, this feels like the line item to look at first. Sonnet has slightly fewer false positives on average (0.8 vs 1.2), but it's close enough that it made me think about using Haiku for this kind of task and then maybe use a bigger model to fix/throw away results that don't matter.

2. Opus is the only model with zero variance across runs

Opus scored 13/14 every single time. Same detection rate, same missed vulnerability, four runs in a row.

Everything else moved. Sonnet ranged from 79% to 93% across its four runs. Haiku did the same. That's a 14-percentage-point swing for "the same model on the same input."

If your security gate is a single /security-review run and it's a mid-tier model, you're partly looking at noise. Re-running matters more than I'd assumed before doing this, so there is a chance that a rinse and repeat with a cheaper model will be almost as good as one of the frontier models, but still come out cheaper.

The cost question

Credits are bucketed per model. Every Haiku run cost roughly 3.3, every Opus run 30.0. So this isn't anecdotal cost data, it's the price list based on the tests:

Model	Credits / 10-change sweep	vs Haiku
claude-haiku-4.5	3.3	1.0×
claude-sonnet-4.6	10.0	3.0×
gpt-5.4	10.0	3.0×
claude-opus-4.6	30.0	9.1×
gpt-5.5	75.0	22.7×

The thing this benchmark left me genuinely uncertain about: is this kind of analysis worth the tokens it consumes with frontier models?

A team running 100 PRs/week with Haiku is spending ~330 credits/week on security review. The same workload on GPT-5.5 is ~7,500. That's a meaningful number if you're paying for credits out of an engineering budget, and at the Haiku end it's small enough that the cost-benefit case writes itself. At the GPT-5.5 end you're paying 22× more for one percentage point less detection, which is harder to defend. Especially when a lot of companies might already have some security tooling set up.

My take is: probably yes for high-stakes diffs, and no for the long tail. But the tool doesn't help you make that call right now. So be mindful of the model you pick.

Disclaimer: This is a fun side project, not deep research

I'd rather have a smaller true claim than a bigger shaky one, so:

n=4 is small. The "Haiku ties Sonnet" finding is consistent with these runs but is not statistically established. With a higher number of runs this could become clearer, but I'm not going to spend all the company's tokens on this. I also need some left to get the AI to do my job.
Juice Shop is well-known. It almost certainly appears in training data for all five models, which would inflate scores roughly uniformly. That's why the interesting comparisons here are between models, not the absolute detection rates. I know there are better benchmarks out there, I was just playing around with something small so decided to pick something I know. I spent a few more tokens building the benchmark.
The grader sees the catalogue. It's calibrated to "does this finding match a catalogued vuln," which is not a perfect 1:1 match and the AI could be wrong. I spot-checked a few matches and they were correct so I trust the grader does OK, especially with the Opus 4.6 model.
One workload. This is /security-review against Node.js diffs with common OWASP-class bugs. I don't know if things are different for other languages. I suspect that less popular languages might show bigger differences in the detection rates.
Models change. This is late May 2026. If you're reading this at a later point, pricing & capabilities changes in models could lead to different conclusions.

What I'd do next if I were funding this properly

Push n to 10+ per model and settle the Haiku-vs-Sonnet question.
Add a private repo benchmark alongside Juice Shop to neutralise training-data effects.
Test "2× Haiku with union" head-to-head against "1× Opus." That's the most useful practical question this data raises and it's still open.
Add a second independent grader for inter-rater calibration, or do something more deterministic.

If you've played with /security-review and seen different patterns or if you have ideas for what codebases would make better targets I'd genuinely like to hear about it.

DEV Community: D Cairo