Andrew for Koalr

Posted on Apr 13

We Scored 28 Famous Open Source PRs for Deploy Risk

#devops #github #programming #opensource

TL;DR
The React Hooks PR that changed every React application on earth? Three words in the commit message. One feature flag removed. It scored 91 out of 100 for deploy risk. The Svelte 5 release scored 99. A 65-line TypeScript change scored 79 and silently broke type inference in codebases worldwide. We ran 28 landmark open source pull requests through Koalr's deploy risk model. Here is what we found — and why it matters for the PRs your team ships every week.

The problem with code review
Modern code review answers one question well: is this code correct?

It answers a different question poorly: how likely is this to cause a production incident?

Those are not the same question. A PR can be clean, well-written, and thoroughly reviewed — and still wreck production because it touches a critical path nobody flagged, because the reviewer had twelve other PRs open, or because it is the fourth consecutive revert of a feature that never landed cleanly.

Most teams have no objective signal for the second question. They have green checkmarks.

What deploy risk scoring is
Koalr scores every pull request from 0 to 100 before it merges. The score is built from 36 signals:

Blast radius signals

How many files changed
What services those files belong to
Whether shared libraries or interfaces were modified
CODEOWNERS compliance — did the right people review the right files

Change quality signals

File churn — how recently and how often these files have been modified
Change entropy — how spread across the codebase the diff is
Lines added vs deleted ratio
Test coverage of changed files

Context signals

Reviewer load — how many open PRs each reviewer currently has
Author's recent incident rate
Time since last deploy to the same service
Revert history on the changed file set

History signals

Consecutive reverts of the same feature
Recent incident correlation with this file set
PR age — how long the branch has been open
A score of 0–39 is Low. 40–69 is Medium. 70–89 is High. 90–100 is Critical.

The score does not replace review. It gives reviewers a number to orient around before they start reading.

The experiment
We pulled 28 of the most consequential pull requests in open source history and ran them through the model. These are PRs the industry knows by name — the ones that shipped features used by millions of developers, or broke them.

Here is what the model said.

The obvious ones scored as expected

Svelte 5 release https://github.com/sveltejs/svelte/pull/13701 — score 99

The full runes rewrite merged to main. Thousands of files changed, the entire reactivity model replaced, years of migration work consolidated into one merge. Of course it scored critical. High blast radius, enormous file count, fundamental architecture change. The model does what you would expect.

TypeScript modules conversion https://github.com/microsoft/TypeScript/pull/51387 — score 98

Microsoft's conversion of the entire TypeScript compiler codebase from namespaces to ES modules. It touched every source file in the compiler, changed the build system, and dropped dependencies. If any PR in history deserved a mandatory all-hands review before merge, it was this one.

The surprising ones — small diffs, enormous blast radius
This is where it gets interesting.

React PR #14679 "Enable hooks!" https://github.com/facebook/react/pull/14679 — score 91

The commit message is three words. The diff is the removal of a single feature flag. You could read the entire change in thirty seconds.

It scored 91.

Why? Because the model does not count lines — it looks at what the changed code controls. A feature flag in a framework used by tens of millions of applications is not a small change. It is a detonation switch. The blast radius is every React application on earth. The model flagged it correctly.

Signals fired:
  blast_radius_score: 0.97
  feature_flag_detected: true
  downstream_consumers: critical
  reviewer_load: 0.2 (core team — low load)

Final score: 91 / Critical

Node.js PR #41749 "lib: add fetch" https://github.com/nodejs/node/pull/41749 — score 82

One file changed: the bootstrap script that runs inside every Node.js process. Adding the global fetch API touched the most critical execution path in the runtime.

Single-file PR. High score. The file changed is what matters, not how many files changed.

TypeScript PR #57465 "Infer type predicates from function bodies" https://github.com/microsoft/TypeScript/pull/57465 — score 79

65 lines of new code. One function modified.

Those 65 lines changed type inference behavior across the entire checker, producing new type errors in codebases that had compiled cleanly for years. A reviewer looks at 65 lines, sees clean code, approves it. The model sees that those 65 lines live inside the type checker core and have cross-cutting effects on every downstream consumer.

This is the failure mode standard review misses every time.

The revert pattern

Next.js PR #45196 https://github.com/vercel/next.js/pull/45196 — score 88

Title: "Revert 'Revert 'Revert 'Revert 'Initial metadata support''"''

PR body: "Hopefully last time."

Four consecutive reverts of the same feature. The model has a specific signal for this: repeated churn on the same file set with revert commits in recent history. It is one of the strongest predictors of another rollback. The PR scored 88 before anyone read a single line of the diff.

The one that surprised us most

The Jest-to-Vitest migration in tRPC — PR #3688 https://github.com/trpc/trpc/pull/3688 — scored 67. Medium risk.

At first glance, that sounds about right for a test runner swap. But look at what actually changed: every single test file in the repository, plus the root configuration, plus the CI pipeline. The surface area was enormous.

The score was “only” 67 because the risk model correctly identified that none of the changed files were production code paths — only test infrastructure. A test runner change cannot break a production deployment directly. What it can do is make future regressions invisible, which is a subtler and harder-to-measure risk.

The model is honest about what it can and cannot see. Broken test infrastructure does not score as a deploy risk — it scores as a coverage risk. Different signal, different response.

The score table
Here are eight of the 28 PRs we scored, with the risk level and the primary reason for the score:

What this means for your PRs
The open source examples are useful because they are public and well-documented. But none of those teams needed a risk model — the React core team was reviewing the hooks PR. It still would have scored 91.

The real value is the ordinary PR your team ships on a Thursday afternoon, reviewed by one person in fifteen minutes, that quietly introduces a breaking change nobody caught. That team does not have the React core team. They have two engineers, a Monday morning deadline, and a PR that looks fine.

That is who Koalr is built for.

Try it
The live risk demo at koalr.com/live-risk-demo scores any public GitHub PR in seconds. No account, no install. Paste a URL, get a score.

If you want to score your own team's PRs — every PR, automatically, as part of your GitHub workflow — there is a free trial at app.koalr.com/signup .