Your spam filter does not know if an email is spam.
It cannot know for certain. Nobody can. The word "free" appears in legitimate emails. The word "urgent" appears in real emergencies. No single signal is definitive.
So the spam filter does not try to be certain. Instead it asks a different question entirely.
Given everything I can see about this email, how likely is it to be spam?
If the answer is 94%, it goes to junk. If it is 23%, it goes to your inbox. The decision is made under uncertainty, using probability.
Every AI classifier works this way. Not yes or no. Not this or that. A number between 0 and 1. Confidence. Likelihood. Probability.
This post is about understanding that number.
What Probability Actually Is
Probability is a number between 0 and 1 that measures how likely something is.
0 means impossible. Will never happen.
1 means certain. Will definitely happen.
0.5 means equally likely to happen or not.
impossible = 0.0 # rolling a 7 on a standard die
certain = 1.0 # the sun rising tomorrow
fair_coin = 0.5 # heads on a fair coin flip
die_six = 1/6 # rolling a 6
That's the scale. Everything is somewhere on it.
For real events, you estimate probability by counting. How many times did it happen out of how many chances?
outcomes = ["spam", "spam", "not spam", "spam", "not spam",
"not spam", "spam", "not spam", "spam", "spam"]
spam_count = outcomes.count("spam")
total = len(outcomes)
prob_spam = spam_count / total
print(f"Spam emails: {spam_count}/{total}")
print(f"P(spam) = {prob_spam:.2f}")
Output:
Spam emails: 6/10
P(spam) = 0.60
6 out of 10 were spam. Probability is 0.60. Sixty percent chance the next email is spam. Simple. Measurable.
Multiple Events: And, Or
What is the probability of two things both happening?
p_spam = 0.60
p_contains_free = 0.40
p_both = p_spam * p_contains_free
print(f"P(spam AND contains 'free') = {p_both:.2f}")
Output:
P(spam AND contains 'free') = 0.24
Multiply probabilities when events are independent and you want both to happen.
What is the probability of at least one thing happening?
p_a = 0.3
p_b = 0.5
p_either = p_a + p_b - (p_a * p_b)
print(f"P(A OR B) = {p_either:.2f}")
Output:
P(A OR B) = 0.65
Add probabilities and subtract the overlap (both happening at once). Without that subtraction you would double-count cases where both happen.
Conditional Probability: The Key to AI
This is where probability becomes genuinely powerful for machine learning.
Conditional probability asks: given that I already know something, how does that change my estimate?
Written as P(A | B). "Probability of A given B."
emails = {
"total": 1000,
"spam": 600,
"has_free_and_spam": 280,
"has_free": 320
}
p_spam = emails["spam"] / emails["total"]
p_spam_given_free = emails["has_free_and_spam"] / emails["has_free"]
print(f"P(spam) = {p_spam:.2f}")
print(f"P(spam | has 'free') = {p_spam_given_free:.2f}")
Output:
P(spam) = 0.60
P(spam | has 'free') = 0.88
Before seeing the word "free": 60% chance of spam.
After seeing the word "free": 88% chance of spam.
The evidence updated your belief. That is the core of probabilistic AI. You start with a prior estimate. You observe evidence. You update.
Bayes' Theorem: The Update Formula
There is a formula that formalizes this update. It is called Bayes' theorem.
P(A|B) = P(B|A) * P(A) / P(B)
Do not let the notation scare you. In plain English:
The probability of A given B equals the probability of B given A, times the prior probability of A, divided by the overall probability of B.
Let's use it on the spam example.
p_spam = 0.60 # prior: 60% of emails are spam
p_free_given_spam = 280/600 # how often does spam contain "free"
p_free_given_not_spam = 40/400 # how often does non-spam contain "free"
p_free = (p_free_given_spam * p_spam +
p_free_given_not_spam * (1 - p_spam))
p_spam_given_free = (p_free_given_spam * p_spam) / p_free
print(f"P(free | spam) = {p_free_given_spam:.2f}")
print(f"P(free | not spam) = {p_free_given_not_spam:.2f}")
print(f"P(free) = {p_free:.2f}")
print(f"P(spam | free) = {p_spam_given_free:.2f}")
Output:
P(free | spam) = 0.47
P(free | not spam) = 0.10
P(free) = 0.32
P(spam | free) = 0.88
Same answer as before: 88%. Bayes' theorem is just a precise way to combine prior knowledge with new evidence.
This exact formula powers Naive Bayes classifiers, medical diagnosis tools, and spam filters. You will implement one later in the machine learning phase.
Probability in Neural Networks: Softmax
Your neural network produces raw numbers. They can be anything. Negative, positive, any range.
You need probabilities. They need to sum to 1. All between 0 and 1.
The function that converts raw numbers into probabilities is called softmax.
import numpy as np
raw_scores = np.array([2.1, 0.8, -0.5, 1.3])
def softmax(x):
exp_x = np.exp(x - np.max(x)) # subtract max for numerical stability
return exp_x / exp_x.sum()
probabilities = softmax(raw_scores)
print("Raw scores:", raw_scores)
print("Probabilities:", np.round(probabilities, 4))
print(f"Sum: {probabilities.sum():.4f}")
Output:
Raw scores: [2.1 0.8 -0.5 1.3]
Probabilities: [0.5154 0.1400 0.0380 0.3066]
Sum: 1.0000
Four classes. The model was most confident about class 0 (51.5%), second most about class 3 (30.7%). All four values sum to exactly 1.0.
This is the output layer of every multi-class classification network. Raw numbers in, probabilities out. The class with the highest probability is the prediction.
Log Probabilities: Why You Always See log() in Loss Functions
Probabilities get very small very fast.
p_word1 = 0.05
p_word2 = 0.03
p_word3 = 0.08
p_word4 = 0.02
joint_probability = p_word1 * p_word2 * p_word3 * p_word4
print(f"Joint probability: {joint_probability}")
Output:
Joint probability: 2.4e-07
Multiply enough small probabilities together and you get a number so close to zero that computers lose precision. Underflow errors. Silent incorrect results.
Logarithms fix this. Log converts multiplication into addition. Small numbers become manageable negative numbers.
import math
log_prob = (math.log(p_word1) + math.log(p_word2) +
math.log(p_word3) + math.log(p_word4))
print(f"Sum of log probs: {log_prob:.4f}")
print(f"Equivalent to: {math.exp(log_prob):.2e}")
Output:
Sum of log probs: -15.2440
Equivalent to: 2.40e-07
Same result. No underflow. This is why cross-entropy loss uses logarithms. This is why language models talk about log-likelihood. The math is the same, the numbers are just more manageable.
Cross-Entropy Loss: Probability Meets Training
When your model outputs probabilities and you know the correct answer, cross-entropy loss measures how wrong the probability distribution is.
import numpy as np
true_labels = np.array([0, 2, 1])
model_probs = np.array([
[0.7, 0.2, 0.1], # sample 1: predicted class 0 with 70% confidence
[0.1, 0.3, 0.6], # sample 2: predicted class 2 with 60% confidence
[0.2, 0.5, 0.3] # sample 3: predicted class 1 with 50% confidence
])
def cross_entropy_loss(probs, labels):
n = len(labels)
correct_probs = probs[np.arange(n), labels]
loss = -np.mean(np.log(correct_probs + 1e-8))
return loss
loss = cross_entropy_loss(model_probs, true_labels)
print(f"Cross-entropy loss: {loss:.4f}")
Output:
Cross-entropy loss: 0.5468
For each sample, grab the probability the model assigned to the correct class. Take the log. Average them. Negate it.
High confidence in the right answer means low loss. Low confidence or wrong answer means high loss. Gradient descent then reduces this loss, pushing the model to be more confident in the right answers.
Cross-entropy loss is the standard loss function for classification. You will use it for everything from spam detection to image classification to language modeling.
Try This
Create probability_practice.py.
Part one: you have a deck of 52 cards. Calculate the probability of drawing a heart. Drawing a face card (J, Q, K). Drawing both a heart and a face card. Drawing either a heart or a face card. Use the formulas from this post, not a lookup.
Part two: implement a simple Naive Bayes text classifier from scratch. You have these training examples:
training = [
("buy cheap medicine now", "spam"),
("free prize winner click", "spam"),
("meeting tomorrow at 3pm", "ham"),
("project deadline this friday", "ham"),
("limited offer free discount", "spam"),
("can we reschedule the call", "ham"),
]
Count word frequencies in spam and ham emails separately. Then classify this new email using conditional probabilities:
new_email = "free meeting tomorrow offer"
For each word, calculate how much more likely it is to appear in spam vs ham. Combine the evidence. Print whether the model thinks it is spam or ham and the confidence.
Part three: implement softmax from scratch. Test it on these raw scores and verify the output sums to 1.
scores = np.array([3.2, 1.5, 0.8, 2.1, -0.5])
What's Next
You have been building intuition for uncertainty and probability. The next post takes this further into the shape that uncertainty most often takes in the real world: the normal distribution. That bell curve. Why it appears everywhere, what it tells you about your data, and how to use it in AI.
Top comments (0)