CodeRabbit’s code reviews help developers fix bugs and ship code. We recently wrote about benchmarking GPT-5 and opined that the model was a generational leap in reasoning for our use case of AI code reviews. As we rolled out to our wider user base, we observed that the signal to noise ratio (SNR) dipped, and users felt the reviews were too pedantic.
The release of GPT‑5 Codex, plus the product changes we made (severity tagging, stricter refactor gating, better filtering), brings our signal to noise ratio back without sacrificing the ability to find the hard bugs.
On our refreshed hard 25 PR set, GPT-5 Codex delivers about 35% higher per comment precision than GPT‑5, maintains essentially the same error pattern- level bug coverage, and cuts roughly a third of the comment volume. Combine that with the lower latency of the GPT-5 Codex model and the experience feels snappier and more focused.
What we measured (and why)
When testing GPT-5 Codex, we ran a fresh “hard 25” suite of OSS PRs (slightly tougher than the previous post). These are 25 of the most difficult pull requests from our dataset. These PRs represent real-world bugs that span:
- Concurrency issues (e.g. TOCTOU races, incorrect synchronization)
- Object-oriented design flaws (e.g. virtual call pitfalls, refcount memory model violations)
- Performance hazards (e.g. runaway cache growth, tight loop stalls)
- Language-specific footguns (e.g. TypeScript misuses, C++ memory order subtleties)
We evaluated the following models: :
- GPT‑5 Codex
- GPT‑5
- Claude (Sonnet 4 and Opus‑4.1)
What we looked for
We gave each of the models a score based on how they performed on these factors:
- EP (Error Pattern). The specific underlying defect seeded in a PR (e.g., lost wakeup on a condition variable, inconsistent lock order, logic bug hidden in boolean soup).
- EP PASS/FAIL (per PR). PASS if the model left at least one comment that directly fixes or credibly surfaces that PR’s EP. If it left no comment on that PR, it is counted as FAIL for that PR.
- Comment PASS/FAIL (per comment). PASS if the comment directly fixes or credibly surfaces the EP, otherwise FAIL.
- Per comment precision. PASS comments ÷ all comments. This is our operational SNR for this dataset.
- Important share. Every PASS is Important. Comments that do not solve the EP but still flag a genuine critical or major bug (like a use after free, double free, lost wakeup, memory leak, null deref, path traversal, catastrophic regex) are also Important. Everything else is Minor.
Scoreboard - Codex improves signal-to-noise
Takeaway: Codex finds essentially the same EPs as GPT‑5 but does it with fewer, tighter comments, so the signal-to-noise ratio is improved.
What this means: Codex covered 20 of the 25 PRs (the other 5 count as uncovered fails). Despite fewer comments overall, Codex passed slightly more EPs (16 vs. 15) and landed far more Important comments. Over half its comments are either direct hits on the issue we were representing in that PR or flag other EP critical bugs. GPT‑5 and Claude trailed in precision and Importance share at about 40%.
The verdict: Same EP coverage, less noise: Codex retains GPT-5’s bug finding power but trims the chatter with about 32% fewer comments than GPT‑5 (54 vs. 79) and about 35% higher per comment precision (46.3% vs. 34.2%). Claude looks similar to GPT‑5 on coverage but is chattier, with lower precision.
Style and structure (why Codex reads like a patch)
Codex replies are consistently action forward (diffs almost always included) and low hedge. That lines up with what reviewers want: suggestions that translate directly into a patch.
The kinds of bugs Codex is good at
- Across the suite, all models did well on concurrency and synchronization, but Codex stood out for:
- Condition variable misuse and lost wakeups. Codex proposes the canonical patterns (wait under lock, check predicate in a loop) and supplies concrete diffs.
- Lock ordering and deadlocks. It calls out inconsistent acquisition order and suggests a lock hierarchy or moving work outside critical sections, again with actionable edits.
- Subtle API and performance traps. Examples include catastrophic regex backtracking and memory model orderings. Codex pinpoints and patches them cleanly.
Why GPT‑5 felt noisier, and how we fixed that
What we saw: When we moved from Sonnet and Opus to GPT‑5 our total comments per review nearly doubled. Even though hallucinations fell to under 1% and negative tone fell to under 1%, the acceptance rate (share of comments judged helpful) declined significantly relative to its baseline prior to the adoption of GPT-5.
What changed with Codex: With GPT‑5 Codex plus some product changes we’ve implemented, our acceptance climbed back to prior levels while overall comment volume stayed higher than the pre-GPT‑5 era. Put simply: our tool is now back to its prior helpfulness level, while still finding as many real issues as GPT-5.
Two product levers helped with this:
We created severity and review type tags, front and center
Review Types: We created review types to allow users to self-select what kinds of comments they wanted to read including: ⚠️ Potential issue, 🛠️ Refactor suggestion, 🧹 Nitpick (nitpicks are hidden unless you opt into Assertive mode)
Severity: We now tag comments by severity to better signal which matter more than others. Our tags are: 🔴 Critical, 🟠 Major, 🟡 Minor, 🔵 Trivial, ⚪ Info
We always show bugs (Critical, Major, Minor) but don’t always show other types of comments. Refactors show only if the model marks them as essential. Users who want everything can still switch to Assertive mode.
We implemented stricter filtering and aggregation
- We collapse duplicative notes and filter out “nice to have” suggestions unless they have clear ROI for the user. The result: fewer, denser comments, and fewer reasons to tune out.
Latency: Fast matters & Codex is faster
A five minute review is fine. Thirty minutes is not. GPT‑5’s “always think hard” style significantly increased time to first token and overall review time. But we shipped several pipeline optimizations recently and Codex helps further reduce the latency that GPT-5 introduced.
Codex’s variable or elastic thinking uses less depth when it is not needed, improving time to first output and end-to-end review time in practice. Net: faster reviews, earlier feedback, better flow for the human in the loop.
What a CodeRabbit user should expect
Now that Codex is implemented, how will that change your AI code reviews?
The same raw bug finding power
On the refreshed hard 25, Codex passed 64% at the EP level vs. 60% for GPT‑5 (our previous set of PRs had GPT-5 passing 77.3%). No loss of the important wins GPT-5 helped with.
Fewer but stronger comments
About 32% fewer total comments than GPT‑5, with about 35% higher SNR (per comment precision). More patches, less prose.
Severity tags to focus your review
Critical and Major issues float to the top with our new severity tags. Refactors are gated. Nitpicks are opt-in. You will spend less time scanning comments and more time fixing.
A faster feedback loop
Codex’s leaner reasoning plus pipeline improvements bring time to first helpful comment down. You will feel it.
Quantitative appendix (for the curious)
We know you love data! Here’s some other stats we found interesting:
- Per comment precision (SNR) uplift: Codex 46.3% vs. GPT‑5 34.2% — about +35% relative.
- Comment volume delta: Codex 54 vs. GPT‑5 79 — 32% fewer comments, with EP passes essentially unchanged (16 vs. 15).
- Style: Codex includes diffs in 94% of comments and uses hedging far less than Claude and GPT‑5 on this set.
- Acceptance (real world): During GPT‑5 rollout, acceptance dropped significantly. With Codex plus changes, it rose by about 20–25% relative and returned to prior levels while still delivering more accepted comments than pre GPT‑5.
Where Codex still needs work (and what we are doing)
These improvements are great but that doesn’t mean that there aren’t still issues with Codex. Here are some that we are actively working on:
- Coverage gaps. When a model leaves no comment on a PR, that is a hard fail for that EP. We are widening Codex’s search heuristics so it is less likely to miss entire classes of issues.
- Refactor over eagerness (tuned, not solved). The “essential only” gate curbs refactor noise, but we will keep tightening the threshold, especially on large diffs where a high number of comments would be overwhelming.
- User-driven prioritization. We cannot change GitHub’s in-line ordering, but we annotate every comment with severity so you can triage from the top down without hunting.
Codex GPT-5: All of the great bug-catching ability, fewer downsides
Our north star is simple: catch the bugs that matter, quickly, without making you sift through noise. Codex helps us do that. It keeps the bite of GPT‑5’s reasoning while restoring SNR and shaving latency down significantly. We will keep measuring, improving, and shipping a better product every release.
Top comments (0)