Zhang Yao

Posted on Apr 9

Git Bayesect: Bayesian Git Bisection for When Bugs Are Playing Games with You

#git #bayesian #testing #python

The problem nobody talks about: what happens when git bisect can't help you because your test suite has become a liar?

The Problem: When Your Test Suite Starts Gaslighting You

You've been there. A test that used to pass 100% of the time now fails 30% of the time. Your CI pipeline lights up red on some commits and not others — but unpredictably. You run git bisect, and it tells you the culprit is commit a1b2c3d, so you stare at that diff for two hours only to realize... the test still fails randomly on main. The bisect gave you a false positive.

This is the flaky test epidemic. In modern software engineering, especially with:

LLM-powered test suites (where outputs are inherently stochastic)
Performance regression tracking (where noisy benchmarks fluctuate by ±5%)
Race condition detectors (where timing-dependent bugs only surface occasionally)
Any test that talks to external services

...the bug isn't boolean. The probability of failure has changed, but git bisect only understands yes/no.

Traditional binary search assumes: if I test commit X, I'll always get the same result. When that assumption breaks, you're left running tests hundreds of times per commit — or just guessing.

The Solution: Enter Bayesian Inference

git bayesect is a Bayesian generalization of git bisect by hauntsaninja. Instead of asking "does this commit fail?", it models the probability that a commit fails, then uses Bayes' theorem to narrow down where that probability changed.

The Core Insight

Assume there's some commit B (the "breakpoint") such that:

For commits b ≤ B (newer): P(fail) = p_new (e.g., 0.8)
For commits b > B (older): P(fail) = p_old (e.g., 0.2)

We don't know B, p_new, or p_old exactly — we just know something changed. Bayesian inference lets us update our beliefs about B after each flaky observation.

Step 1: The Prior — Where Do We Think the Bug Lives?

Start with a uniform prior over all commits: P(B = b) = 1/N for each commit index. You can also use custom priors if you have hunches — git bayesect lets you set weights based on filenames or commit messages:

# Boost commits that touch "timeout" in their message
git bayesect priors_from_text --text-callback "return 10 if 'timeout' in text.lower() else 1"

# Boost commits touching suspicious files
git bayesect priors_from_filenames --filenames-callback "return 10 if any('suspicious' in f for f in filenames) else 1"

Setting a prior to zero is equivalent to git bisect skip.

Step 2: The Bayesian Update — How Observations Change Our Beliefs

When you observe a yes (test failed) at commit index i:

# W = weights encoding our prior (numpy array)
W[index:] *= p_new   # For commits ≤ index, p_new explains the failure
W[:index] *= p_old   # For commits > index, p_old is in effect
W /= W.sum()         # Normalize to get posterior probability distribution

When you observe a no (test passed):

W[index:] *= (1 - p_new)
W[:index] *= (1 - p_old)
W /= W.sum()

Each observation is a likelihood-weighted scaling of our probability distribution. This is just Bayes' theorem in vectorized form.

Step 3: Choosing the Next Commit — Greedy Entropy Minimization

Binary search picks the midpoint. Bayesian search picks the commit that maximizes expected information gain:

argmin_i  E[H(P(B | observation at i))]
        = argmin_i P(yes_i) * H(P(B | yes_i)) + P(no_i) * H(P(B | no_i))

Where H is Shannon entropy — measuring how uncertain we are about B.

# Naive O(n²) version:
for i in range(N):
    W_if_yes = posterior(W.copy(), i, observation=True)
    W_if_no  = posterior(W.copy(), i, observation=False)
    p_yes_i = p_new * W[i:].sum() + p_old * W[:i].sum()
    entropies[i] = p_yes_i * entropy(W_if_yes) + (1 - p_yes_i) * entropy(W_if_no)

selected_index = np.argmin(entropies)

The actual implementation uses prefix/suffix sums to compute this in O(n) time with full vectorization.

Why is this better than bisecting at the median CDF?

It optimizes for information gain, not just position
If H(p_new) ≠ H(p_old) (asymmetric failure rates), observations aren't equally informative — entropy minimization adapts
The objective is mathematically grounded via the logarithmic scoring rule

Note: greedy entropy minimization isn't theoretically optimal (you can construct toy cases where it gets stuck), but in practice it converges extremely well.

Step 4: The Beta-Bernoulli Conjugacy Trick — Handling Unknown Probabilities

Here's the catch: we don't actually know p_new and p_old either!

The naive solution — run 100 trials at the newest and oldest commits to estimate them — is wasteful. Instead, we put prior distributions on p_new and p_old:

p_new ~ Beta(α_new, β_new)
p_old ~ Beta(α_old, β_old)

We treat α as pseudo-counts of "yes" observations and β as pseudo-counts of "no" observations. Then we marginalize over all possible values of p_new and p_old to get the posterior distribution over B.

The conjugacy means the marginal likelihood integral evaluates to a closed form — no MCMC, no numerical integration, no approximation. It's just scaling by ratios of Beta functions, which is computationally trivial.

Installation and Usage

pip install git_bayesect
# or: uv tool install git_bayesect

Basic Workflow

# Start a bayesection between main and an older commit
git bayesect start --old $OLD_COMMIT

# Record observations as you test commits
git bayesect fail    # test failed on current commit
git bayesect pass    # test passed on current commit

# Or specify a commit explicitly
git bayesect fail --commit abc123

# Check your current belief state
git bayesect status

# Auto-run with a command (git bayesect will interpret exit code)
git bayesect run python -m pytest tests/

# Let bayesect pick the next commit to checkout
git bayesect checkout

# Undo the last observation
git bayesect undo

# Reset when you're done
git bayesect reset

Demo Script

# Generate a fake repo with a known flaky breakpoint
python scripts/generate_fake_repo.py
cd fake_repo

# Inspect the history
git log --oneline

# Start bayesection
OLD_COMMIT=$(git rev-list HEAD --reverse | head -n 2 | tail -n 1)
git bayesect start --new main --old $OLD_COMMIT

# Run the automated bayesection
git bayesect run python flaky.py

The Numbers: When Does It Actually Help?

The author's benchmarks tell a stark story:

Scenario	Traditional Bisect	git bayesect
90% → 10% flakiness	~44% accuracy	~96% accuracy
70% → 30% flakiness	~9% accuracy	~67% accuracy

At extreme flakiness ratios (90/10), traditional bisect is basically a coin flip. Bayesect maintains high accuracy because it reasons probabilistically about each observation rather than treating them as ground truth.

Real-World Use Cases

LLM-Powered Test Suites

LLM-evaluated tests (e.g., "did the generated code pass my assertion?") are inherently stochastic. Running them through git bayesect lets you track down commits that shift the LLM's behavior — without needing 50 reruns per commit.

Noisy Performance Benchmarks

Perf regressions in compiled languages can have ±3% noise due to cache effects, OS scheduling, etc. Bayesect handles this gracefully: if a benchmark went from averaging 100ms to averaging 110ms with high variance, binary search fails but Bayesian inference works.

Race Conditions and Timing Bugs

Bugs that only surface under specific timing conditions are non-deterministic by nature. Bayesect can help isolate the commit that made them more likely, even if you can't make them 100% reproducible.

Future Directions

From HN discussions, promising extensions include:

Cost-aware selection: Different commits have different test run times (e.g., one triggers a full rebuild). Future versions could factor in expected time-to-information-gain.
Posterior visualization: Expose the full posterior distribution over B so you can see your confidence interval, not just the MAP estimate.
Multi-hypothesis tracking: Handle cases where multiple independent flakiness sources exist.

Conclusion

git bayesect is a beautiful application of Bayesian reasoning to a real engineering pain point. The math is elegant (Beta-Bernoulli conjugacy, greedy entropy minimization), the implementation is practical (pure Python, pip install), and the results are dramatically better than naive bisection on probabilistic bugs.

If you've ever spent hours manually bisecting through flaky tests, or given up and just rolled back half your codebase — this tool is for you. When your code starts gaslighting you, Bayesian inference fights back.

Filed under: debugging, git, python, bayesian-statistics

DEV Community