The One-Line Summary: The sigmoid function transforms any number from negative infinity to positive infinity into a probability between 0 and 1, doing so smoothly, symmetrically, and with a mathematically convenient derivative — making it perfect for converting linear predictions into probabilities.
Act I: The Kingdom of Infinite Predictions
Once upon a time, in the Kingdom of Predictionia, there lived a Royal Oracle named Linear.
Oracle Linear was brilliant at seeing patterns. Give her data about a person — their age, income, behavior — and she would proclaim a number representing how likely they were to buy the King's magical potions.
THE ORACLE'S PROCLAMATIONS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Citizen Alice: "Her buying score is +2.3"
Citizen Bob: "His buying score is -1.7"
Citizen Carol: "Her buying score is +15.8"
Citizen Dave: "His buying score is -847.2"
The King was confused.
"Oracle Linear," he said, "what does +15.8 mean? Is Carol 15.8% likely to buy? Or 158% likely? And Dave... is he NEGATIVE likely to buy? What does that even mean?!"
Oracle Linear shrugged. "I just find patterns, Your Majesty. I never promised my numbers would make sense as probabilities."
The Kingdom had a problem.
Act II: The Failed Solutions
The King summoned his advisors to solve the probability problem.
Advisor #1: Sir Clip-a-Lot
SIR CLIP-A-LOT'S SOLUTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Simple! If the number is below 0, call it 0.
If it's above 1, call it 1.
Clip the extremes!"
Score: -847.2 → Probability: 0.0
Score: -1.7 → Probability: 0.0
Score: +0.3 → Probability: 0.3
Score: +2.3 → Probability: 1.0
Score: +15.8 → Probability: 1.0
The King frowned. "But this means Dave with -1.7 and someone with -847.2 both have 0% probability? Surely Dave is MORE likely than -847 Dave!"
Sir Clip-a-Lot's solution lost information at the extremes.
Advisor #2: Lady Linear-Scale
LADY LINEAR-SCALE'S SOLUTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Let's linearly scale everything between the
minimum and maximum we've seen!"
Scores: -847.2, -1.7, +0.3, +2.3, +15.8
Min: -847.2, Max: +15.8
Range: 863
Scaled:
-847.2 → 0.00
-1.7 → 0.98 (because it's close to 0!)
+0.3 → 0.98
+2.3 → 0.98
+15.8 → 1.00
The King was furious. "Now everyone except Dave looks identical! One extreme outlier ruined everything!"
Lady Linear-Scale's solution was too sensitive to outliers.
Advisor #3: Duke Threshold
DUKE THRESHOLD'S SOLUTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Forget probabilities. Just say YES or NO.
Above 0? YES. Below 0? NO."
Score: -847.2 → NO (0)
Score: -1.7 → NO (0)
Score: +0.3 → YES (1)
Score: +2.3 → YES (1)
Score: +15.8 → YES (1)
The King sighed. "But I don't want just YES or NO. I want to KNOW how confident we are! Is +0.3 the same as +15.8? Clearly not!"
Duke Threshold's solution destroyed all nuance.
Act III: The Mysterious Mathematician
One day, a mysterious mathematician arrived at the castle. She introduced herself only as σ (Sigma).
"I hear you need to convert any number into a probability," she said softly. "I can help. But I must warn you — I never say 'absolutely certain' or 'absolutely impossible.' I deal only in shades of likelihood."
The King was intrigued. "Show me."
σ smiled and drew a beautiful S-curve:
THE SIGMOID FUNCTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
σ(z) = 1 / (1 + e^(-z))
1.0 │ ●●●●●●●●●●●●●●●●
│ ●●●
│ ●●
0.8 │ ●●
│ ●
│ ●
0.6 │ ●
│ ●
0.5 │─────────────●─────────────────────────────
│ ●
0.4 │ ●
│ ●
0.2 │ ●●
│ ●●
│ ●●●
0.0 │●●●
└───────────────────────────────────────────
-6 -4 -2 0 2 4 6 8 10
z
"No matter what number you give me," said σ,
"I will return a probability between 0 and 1.
Always. Without exception. Forever."
Act IV: The Five Promises of Sigma
σ made five promises to the King:
Promise #1: "I Will Always Give Valid Probabilities"
σ'S FIRST PROMISE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Give me ANY number — positive, negative, huge, tiny.
I will ALWAYS return something between 0 and 1."
Input Output
─────────────────────────────
-1,000,000 → 0.0000... (very close to 0)
-10 → 0.0000454
-2 → 0.119
0 → 0.500
+2 → 0.881
+10 → 0.9999546
+1,000,000 → 0.9999... (very close to 1)
"But notice — I never actually SAY 0 or 1.
I approach them infinitely, but never touch.
There is always a sliver of doubt."
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Test with extreme values
test_values = [-1000000, -10, -2, 0, 2, 10, 1000000]
print("PROMISE #1: Always between 0 and 1")
print("="*50)
for z in test_values:
p = sigmoid(z)
print(f"σ({z:>10}) = {p:.10f}")
Output:
PROMISE #1: Always between 0 and 1
==================================================
σ( -1000000) = 0.0000000000
σ( -10) = 0.0000453979
σ( -2) = 0.1192029220
σ( 0) = 0.5000000000
σ( 2) = 0.8807970780
σ( 10) = 0.9999546021
σ( 1000000) = 1.0000000000
Promise #2: "I Am Perfectly Balanced"
σ'S SECOND PROMISE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"I am symmetric around the center.
Whatever I do to positive numbers,
I do the mirror opposite to negative numbers."
σ(0) = 0.5 (exactly in the middle!)
σ(-2) = 0.119
σ(+2) = 0.881 → These sum to 1.0!
σ(-5) = 0.0067
σ(+5) = 0.9933 → These sum to 1.0!
The mathematical beauty:
σ(-z) = 1 - σ(z)
print("PROMISE #2: Perfect symmetry")
print("="*50)
for z in [1, 2, 3, 5, 10]:
pos = sigmoid(z)
neg = sigmoid(-z)
print(f"σ({z}) = {pos:.6f}, σ({-z}) = {neg:.6f}, Sum = {pos + neg:.6f}")
Output:
PROMISE #2: Perfect symmetry
==================================================
σ(1) = 0.731059, σ(-1) = 0.268941, Sum = 1.000000
σ(2) = 0.880797, σ(-2) = 0.119203, Sum = 1.000000
σ(3) = 0.952574, σ(-3) = 0.047426, Sum = 1.000000
σ(5) = 0.993307, σ(-5) = 0.006693, Sum = 1.000000
σ(10) = 0.999955, σ(-10) = 0.000045, Sum = 1.000000
Promise #3: "I Transition Smoothly"
σ'S THIRD PROMISE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Unlike Duke Threshold who jumps abruptly from 0 to 1,
I transition gently. Small changes in input cause
small changes in output. No surprises."
DUKE THRESHOLD (step function):
1 │ ┌────────────
│ │
0 │──────────┘
└─────────────────────────
0
σ (sigmoid function):
1 │ ●●●●●●●●●
│ ●●●
│ ●●
│ ●
│ ●
0 │●●●●●●●
└─────────────────────────
0
"I am differentiable everywhere —
which means I play nicely with calculus,
which means I can be optimized with gradient descent!"
Promise #4: "My Derivative Is Beautiful"
σ'S FOURTH PROMISE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"If you ever need to know how fast I'm changing
(my derivative), it's elegantly simple:
σ'(z) = σ(z) × (1 - σ(z))
I can compute my own derivative using just my output!
No complicated math needed."
z σ(z) σ'(z) = σ(z)×(1-σ(z))
──────────────────────────────────────────
-3 0.047 0.045 (slow change)
-1 0.269 0.197 (medium change)
0 0.500 0.250 (fastest change!)
1 0.731 0.197 (medium change)
3 0.953 0.045 (slow change)
"I change fastest at z=0 (where uncertainty is highest)
and slowest at the extremes (where I'm already confident)."
def sigmoid_derivative(z):
s = sigmoid(z)
return s * (1 - s)
print("PROMISE #4: Beautiful derivative")
print("="*50)
print(f"{'z':<8} {'σ(z)':<12} {'σ´(z)':<12}")
print("-"*32)
for z in [-3, -2, -1, 0, 1, 2, 3]:
s = sigmoid(z)
ds = sigmoid_derivative(z)
print(f"{z:<8} {s:<12.6f} {ds:<12.6f}")
Output:
PROMISE #4: Beautiful derivative
==================================================
z σ(z) σ´(z)
--------------------------------
-3 0.047426 0.045177
-2 0.119203 0.104994
-1 0.268941 0.196612
0 0.500000 0.250000
1 0.731059 0.196612
2 0.880797 0.104994
3 0.952574 0.045177
Promise #5: "I Represent Log-Odds Linearly"
σ'S FIFTH PROMISE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Here's my deepest secret. If p = σ(z), then:
z = ln(p / (1-p))
This means z is the LOG-ODDS!
And the log-odds is a LINEAR function of features.
So underneath my curved exterior, I'm working with
good old linear regression — just on a different scale."
If σ(z) = 0.9, what is z?
z = ln(0.9 / 0.1) = ln(9) = 2.197
Check: σ(2.197) = 0.9 ✓
This is why logistic regression is called "regression"!
The log-odds (z) is being regressed linearly.
Act V: Why the Kingdom Chose Sigma
The King was convinced. Here's why σ was perfect:
THE KING'S SUMMARY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Problem σ's Solution
─────────────────────────────────────────────────────
Linear outputs go -∞ to +∞ → Squished to (0,1)
Need valid probabilities → Always 0 < p < 1
Need smooth transitions → Infinitely differentiable
Need to optimize with calculus → Simple derivative: σ(1-σ)
Need symmetric behavior → σ(-z) = 1 - σ(z)
Need interpretable model → Log-odds is linear
Need efficient computation → Just exp() and division
And so, σ the Sigmoid became the Royal Probability Converter, and the Kingdom of Predictionia prospered with sensible predictions forevermore.
The Mathematical Definition
THE SIGMOID FUNCTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1
σ(z) = ─────────────
1 + e^(-z)
WHERE:
• z is any real number (the input)
• e is Euler's number (≈ 2.71828)
• σ(z) is always between 0 and 1 (the output)
ALTERNATIVE FORMS:
e^z
σ(z) = ─────── (multiply top and bottom by e^z)
1 + e^z
1
σ(z) = ─ (1 + tanh(z/2)) (relationship to tanh)
2
Code: The Complete Sigmoid
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
"""The sigmoid function."""
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(z):
"""Derivative of sigmoid."""
s = sigmoid(z)
return s * (1 - s)
def inverse_sigmoid(p):
"""Inverse sigmoid (logit function)."""
return np.log(p / (1 - p))
# Demonstrate all properties
print("THE SIGMOID FUNCTION: COMPLETE DEMONSTRATION")
print("="*60)
# Property 1: Always between 0 and 1
print("\n1. BOUNDED OUTPUT (always between 0 and 1):")
extreme_inputs = [-100, -10, -1, 0, 1, 10, 100]
for z in extreme_inputs:
print(f" σ({z:>4}) = {sigmoid(z):.10f}")
# Property 2: Symmetry
print("\n2. SYMMETRY (σ(-z) = 1 - σ(z)):")
for z in [1, 2, 5]:
print(f" σ({z}) + σ({-z}) = {sigmoid(z):.6f} + {sigmoid(-z):.6f} = {sigmoid(z) + sigmoid(-z):.6f}")
# Property 3: Center point
print("\n3. CENTER POINT:")
print(f" σ(0) = {sigmoid(0)} (exactly 0.5)")
# Property 4: Derivative
print("\n4. DERIVATIVE (σ'(z) = σ(z) × (1-σ(z))):")
print(f" Maximum derivative at z=0: σ'(0) = {sigmoid_derivative(0)}")
# Property 5: Inverse
print("\n5. INVERSE (logit function):")
for p in [0.1, 0.5, 0.9]:
z = inverse_sigmoid(p)
print(f" If σ(z) = {p}, then z = {z:.4f}")
Why Sigmoid Over Other Options?
WHY NOT OTHER "SQUISHING" FUNCTIONS?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OPTION 1: Step Function
┌── 1 if z ≥ 0
f(z) = ──┤
└── 0 if z < 0
❌ Not differentiable (can't use gradient descent)
❌ No nuance (just 0 or 1)
OPTION 2: Linear Clipping
┌── 0 if z < 0
f(z) = ──┼── z if 0 ≤ z ≤ 1
└── 1 if z > 1
❌ Not smooth (kinks at 0 and 1)
❌ Derivative is 0 outside [0,1] (vanishing gradient)
OPTION 3: Tanh (Hyperbolic Tangent)
f(z) = (e^z - e^(-z)) / (e^z + e^(-z))
Range: -1 to +1 (not 0 to 1!)
✓ Smooth and differentiable
⚠️ Needs rescaling for probabilities
OPTION 4: Sigmoid ✓
f(z) = 1 / (1 + e^(-z))
✓ Range exactly 0 to 1 (perfect for probabilities)
✓ Smooth and differentiable everywhere
✓ Simple, elegant derivative
✓ Natural probabilistic interpretation (log-odds)
✓ Computationally efficient
The Sigmoid Family Portrait
THE SIGMOID AND ITS RELATIVES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SIGMOID (Logistic): σ(z) = 1/(1+e^(-z))
Range: (0, 1)
Use: Binary classification, output layer
TANH: tanh(z) = (e^z - e^(-z))/(e^z + e^(-z))
Range: (-1, 1)
Use: Hidden layers (zero-centered)
Relationship: tanh(z) = 2σ(2z) - 1
SOFTMAX: softmax(zᵢ) = e^(zᵢ) / Σe^(zⱼ)
Range: (0, 1) for each, sum to 1
Use: Multi-class classification
Relationship: Sigmoid is softmax for 2 classes!
THEY'RE ALL RELATED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
sigmoid(z) = (1 + tanh(z/2)) / 2
softmax([z, 0]) = [sigmoid(z), sigmoid(-z)]
When Sigmoid Struggles
Even our hero σ has weaknesses:
THE VANISHING GRADIENT PROBLEM:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
When z is very large or very small:
σ'(z) ≈ 0
z = -10: σ(-10) = 0.0000454, σ'(-10) = 0.0000454
z = +10: σ(+10) = 0.9999546, σ'(+10) = 0.0000454
The gradient is essentially ZERO!
In deep neural networks, this means:
• Gradients shrink exponentially through layers
• Weights stop updating
• Learning grinds to a halt
THIS IS WHY RELU REPLACED SIGMOID IN HIDDEN LAYERS:
ReLU(z) = max(0, z)
• Gradient is 1 for positive inputs
• No vanishing gradient problem
BUT SIGMOID IS STILL PERFECT FOR:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ Output layer for binary classification
✓ Gates in LSTM/GRU (need 0-1 range)
✓ Logistic regression
✓ Any time you need a probability output
Quick Reference Card
THE SIGMOID FUNCTION: QUICK REFERENCE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
FORMULA: σ(z) = 1 / (1 + e^(-z))
DOMAIN: All real numbers (-∞, +∞)
RANGE: (0, 1) — perfect for probabilities!
CENTER: σ(0) = 0.5
SYMMETRY: σ(-z) = 1 - σ(z)
DERIVATIVE: σ'(z) = σ(z) × (1 - σ(z))
Maximum at z=0, where σ'(0) = 0.25
INVERSE: z = ln(p / (1-p)) [logit function]
LIMITS: lim(z→-∞) σ(z) = 0
lim(z→+∞) σ(z) = 1
SHAPE: S-curve (hence "sigmoid" = S-shaped)
USE CASES: • Logistic regression output
• Neural network output for binary classification
• LSTM/GRU gates
• Any probability conversion
WEAKNESS: Vanishing gradient for extreme inputs
(don't use in hidden layers of deep networks)
The Story's Moral
THE MORAL OF THE STORY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The world is full of unbounded quantities:
• Scores that go from -∞ to +∞
• Sums that can be any real number
• Linear combinations without limits
But probabilities must live in [0, 1].
The sigmoid function is the PERFECT TRANSLATOR:
• Takes any real number
• Returns a valid probability
• Does so smoothly and elegantly
• Has beautiful mathematical properties
She never says "impossible" (0) or "certain" (1).
She always leaves room for doubt.
And that humility is what makes her perfect.
In the words of σ herself:
"I transform infinity into certainty,
yet I never claim to be certain myself."
Key Takeaways
Sigmoid squishes (-∞, +∞) to (0, 1) — Any input becomes a valid probability
σ(z) = 1/(1+e⁻ᶻ) — Simple formula, profound implications
Symmetric around 0.5 — σ(-z) = 1 - σ(z)
Beautiful derivative — σ'(z) = σ(z)(1-σ(z)), computed from output alone
Represents log-odds linearly — Why logistic regression works
Perfect for output layers — When you need probability output
Avoid in hidden layers — Vanishing gradient problem; use ReLU instead
Never touches 0 or 1 — Always maintains a sliver of uncertainty
The One-Sentence Summary
The sigmoid function is the diplomatic mathematician who takes any number from negative infinity to positive infinity and transforms it into a probability between 0 and 1, doing so smoothly, symmetrically, and with a derivative so elegant (σ times 1-σ) that it makes calculus weep with joy — which is why it's the perfect function for converting linear predictions into the probabilities we need for classification.
What's Next?
Now that you understand the sigmoid, explore:
- Softmax Function — Sigmoid's multiclass cousin
- Activation Functions — ReLU, Tanh, and beyond
- Vanishing Gradients — Why deep networks struggled
- Cross-Entropy Loss — The perfect partner for sigmoid
Follow me for the next article in this series!
Let's Connect!
If the story of σ made the sigmoid click, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's your favorite mathematical function? Mine is now sigmoid — she's humble, elegant, and transforms chaos into probability! 🎭
Once upon a time, Oracle Linear gave predictions of +847 and -352, and the King didn't know what to do. Then σ arrived and said, "Let me translate those into 99.97% and 0.0000000001%." And the Kingdom finally had probabilities that made sense.
Share this with someone who finds the sigmoid mysterious. After meeting σ, they'll never forget her.
Happy probability converting! 📊
Top comments (0)