Bayes' Theorem describes how to update the probability of a hypothesis when new data is obtained. It reverses conditional probability: instead of asking “what is the probability of the data given the hypothesis?”, we ask “what is the probability of the hypothesis given the observed data?”
P(A∣B) = P(B∣A)*P(A) / P(B)
P(A∣B) = Likelihood × Prior / Marginal likelihood
-
P(A∣B)— posterior probability. The probability of hypothesis A after observing data B. This is what we want to find. -
P(B∣A)— likelihood. The probability of observing data B if hypothesis A is true. -
P(A)— prior probability. The initial probability of the hypothesis before seeing the data. -
P(B)— marginal likelihood or evidence. A normalizing factor that ensures the sum of all posterior probabilities equals 1.
Example breakdown:
- Disease affects 1% of the population: P(sick) = 0.01
- Test correctly detects the disease in 95% of sick patients: P(positive∣sick) = 0.95
- Test gives false positive results in 5% of healthy people: P(positive∣healthy) = 0.05
Question: A person gets a positive test result. What is the probability they are actually sick?
Solution
P(sick∣positive) = 0.95*0.01 / 0.95*0.01 + 0.05*0.99 =
= 0.0095 / 0.0095 + 0.0495 =
= 0.0095 / 0.059 ≈
≈ 0.161
Answer: ~16.1%. Even with a positive test, the probability of being sick is low due to the low base rate of the disease and the relatively high rate of false positives. This is often counterintuitive to people (the base rate fallacy).
Even with an accurate test, because the disease is rare, most positive results will be false. That’s why in machine learning we always look at Precision, not just Accuracy.
Accuracy is the proportion of all correct answers made by the model.
Accuracy = TP+TN / TP+TN+FP+FN
Precision is the proportion of actually sick people among all those the model labeled as "sick".
Precision = TP / TP+FP
- TP (True Positive) – correctly predicted "sick"
- TN (True Negative) – correctly predicted "healthy"
- FP (False Positive) – incorrectly said "sick" to a healthy person
- FN (False Negative) – incorrectly said "healthy" to a sick person
So Bayes' theorem gives you a formula for calculating precision from known parameters. The Bayes formula for a medical test is Precision
Precision = ~16.1%. That is, only 16.1% of those the test says are "sick" are actually sick.
Now let’s calculate accuracy using the same numbers (a city of 100,000 people):
Accuracy = 95%
Accuracy = TP+TN / 100000 =
= 990+94050 / 100000 =
= 95040 / 100000 =
= 0.9504 = 95%
Paradox: Accuracy is high (95%), but the model is useless. Why? Because it almost always just says "healthy" — and ends up being right in 95 out of 100 cases (since 99% of people are healthy).
Accuracy is useless in cases of severe class imbalance. In our example, Accuracy = 95%, but Precision = 16.7% — the model cannot find sick people; it simply always predicts "healthy". Bayesian reasoning shows that even with an "accurate" test, the posterior probability of disease (i.e., precision) remains low due to a small prior probability and a large number of false positives.
Hiring Example
You have two recruiters:
- First (experienced) – correctly evaluates a candidate with probability 0.9, makes a mistake with probability 0.1.
- Second (trainee) – does this with probabilities 0.6 and 0.4
- The probability that a random candidate is good is 0.1.
A candidate receives a positive evaluation. What is the probability that the evaluation was made by the experienced recruiter?
P(A∣B) = 0.9×0.5 / 0.9×0.5 + 0.6×0.5 = 0.45 / 0.75 = 0.6
This is not the probability that the candidate is good. This is the probability that the positive evaluation came from the experienced recruiter (and not from the trainee), given that the evaluation is positive.
Now, what is the probability that the candidate is good?
Case 1: Experienced recruiter gave a positive evaluation
P(good | positive from experienced) = 0.9×0.1 / 0.9×0.1 + 0.1×0.9 =
= 0.09 / 0.09 + 0.09 =
= 0.09 / 0.18 =
= 0.5
Answer: 50%
Case 2: Trainee gave a positive evaluation
P(good | positive from trainee) = 0.6×0.1 / 0.6×0.1 + 0.4×0.9 =
= 0.06 / 0.06 + 0.36 =
= 0.06 / 0.42 ≈
≈ 0.143
Answer: ~14.3%
Case 3: Both recruiters gave a positive evaluation (independently)
P(both positive | good) = 0.9×0.6 = 0.54
P(both positive | bad) = 0.1×0.4 = 0.04
P(good | both positive) = 0.54×0.1 / 0.54×0.1 + 0.04×0.9 =
= 0.054 / 0.054 + 0.036 =
= 0.054 / 0.09 =
= 0.6
Answer: 60%
Summary
Bayes' Theorem describes how to update the probability of a hypothesis when data appears. Formula: P(A∣B) = P(B∣A) * P(A) / P(B). In the context of machine learning, it is important for three reasons:
- The Naïve Bayes classifier builds this relationship for each class;
- It explains why a rare event is hard to detect even with an accurate test — due to the low prior probability;
- Bayesian inference, used in Bayesian neural networks and Gaussian processes, is based on it.
Questions
Q: Why is P(B) a sum?
A: Because it is the total probability of event B across all possibilities (A and not A).
Q: "Naïve" in Naïve Bayes classifier — what does "naïve" mean?
A: The assumption of feature independence given the class. This assumption is usually violated, but the model still often works well.
Q: If P(A) is very small, what happens?
A: The posterior probability will also be small, even with strong evidence — this is the key intuition being tested.
Q: How is Bayes' Theorem related to regularization?
A: Maximizing posterior probability (MAP estimation) = maximizing likelihood (MLE) + logarithm of the prior distribution over weights. If you take a Laplace prior — you get L1 regularization (Lasso). If a Gaussian prior — L2 (Ridge). Bayes' Theorem directly gives us regularization as a way to incorporate our prior knowledge about the weights.
Top comments (0)