From Zero to AI: A Complete Journey
π Table of Contents
- Introduction
- Step 0: Math Foundations for AI
- Step 1: Linear Regression
- Step 2: Perceptron
- Step 3: Logistic Regression
- Step 4: Multiple Neurons
- Step 5: Hidden Layers & XOR
- Step 6: PyTorch
- The Complete Picture
- Key Takeaways
Introduction
This article takes you on a complete journey from understanding basic mathematical concepts to building neural networks with PyTorch. Each step builds on the previous one, creating a solid foundation for understanding how AI really works.
What makes this journey special:
- No black boxes: Every concept is explained from first principles
- Code + Concepts: Both the "what" and "why" are covered
- Progressive learning: Each step naturally leads to the next
- Real understanding: You'll know how AI works, not just how to use it
Step 0: Math Foundations for AI
The Big Idea
AI doesn't think like humansβit performs mathematical operations. At its core, AI is just math with data.
The Fundamental Equation
Every neuron in AI follows this simple equation:
z = x Β· w + b
Breaking it down:
- x = inputs (features) - What you feed into the AI
- w = weights (importance) - How much each feature matters
- b = bias (starting push) - A constant adjustment
- z = score (before making a decision) - The final calculation
Real-World Analogy
Imagine deciding if a student passes a class:
import numpy as np
# Student's scores
math_score = 80 # xβ
science_score = 70 # xβ
english_score = 75 # xβ
# How important each subject is
math_weight = 0.6 # wβ (most important)
science_weight = 0.3 # wβ
english_weight = 0.1 # wβ
# Starting adjustment
bias = -50 # b
# Calculate final score
final_score = (80 Γ 0.6) + (70 Γ 0.3) + (75 Γ 0.1) - 50
= 48 + 21 + 7.5 - 50
= 26.5
Code Explanation:
-
import numpy as np: Imports NumPy for numerical operations - Each feature (math, science, english) has a weight
- The dot product multiplies each feature by its weight
- Bias adjusts the final score
- If
final_score β₯ 60, the student passes
Key Concepts
1. Vectors
A vector is a collection of numbers arranged in order:
# Single student's scores
x = np.array([80, 70, 75])
print("Student vector:", x)
print("Shape:", x.shape) # (3,) = 3 elements
Code Explanation:
-
np.array([80, 70, 75]): Creates a NumPy array (vector) - Each number represents one feature
- Shape
(3,)means 1D array with 3 elements
2. Weights
Weights determine how important each feature is:
# Weights (how important each subject is)
weights = np.array([0.6, 0.3, 0.1])
print("Weights:", weights)
Key Properties:
- Higher weight = more important feature
- Weights are learned during training (we'll see this later)
- Weights can be positive or negative
3. Dot Product
The dot product multiplies corresponding elements and sums them:
# Student scores
x = np.array([80, 70, 75])
# Weights
w = np.array([0.6, 0.3, 0.1])
# Dot product
z = np.dot(x, w)
# Equivalent to: (80Γ0.6) + (70Γ0.3) + (75Γ0.1) = 76.5
print("Dot product:", z)
Code Explanation:
-
np.dot(x, w): Calculates dot product - Alternative syntax:
x @ w(matrix multiplication operator) - Result: Single number representing weighted sum
4. Bias
Bias is a constant adjustment to the score:
b = -50 # Negative bias (makes it harder to pass)
z_with_bias = z + b
print("Score with bias:", z_with_bias)
Understanding Bias:
- Positive bias: Makes it easier to get a high score
- Negative bias: Makes it harder to get a high score
- Zero bias: No adjustment
5. Building a Mini Neuron
A neuron is the basic building block of AI:
def neuron(x, w, b):
"""
Simple neuron function
Parameters:
x: input features (vector)
w: weights (vector)
b: bias (scalar)
Returns:
z: output score
"""
return np.dot(x, w) + b
# Test the neuron
x = np.array([80, 70, 75])
w = np.array([0.6, 0.3, 0.1])
b = -50
result = neuron(x, w, b)
print(f"Neuron output: {result}")
Code Explanation:
-
def neuron(x, w, b):: Defines a reusable function -
return np.dot(x, w) + b: The core calculation - This is the foundation of all AI!
Matrix Operations for Multiple Students
When dealing with multiple students, use matrix multiplication:
# Multiple students (each row is a student)
X = np.array([
[90, 85, 70], # Student 1
[40, 50, 60], # Student 2
[75, 70, 80], # Student 3
])
# Weights (same for all students)
w = np.array([0.6, 0.3, 0.1])
b = -50
# Matrix multiplication: X @ w
scores = X @ w + b
print("Scores for all students:", scores)
Code Explanation:
-
X: 2D matrix (3 students Γ 3 features) -
X @ w: Matrix multiplication calculates dot product for each row -
+ b: Broadcasting adds bias to each score - Much faster than loops!
Key Takeaways from Step 0
β
Features: Real-world measurements used as inputs
β
Vectors: Collections of features
β
Weights: Importance of each feature
β
Dot Product: Efficient way to combine features and weights
β
Bias: Constant adjustment to the score
β
Neuron: Basic building block that calculates z = xΒ·w + b
β
Matrix Operations: Processing multiple examples efficiently
Step 1: Linear Regression
The Big Idea
Linear Regression answers: How can we predict a number as accurately as possible?
Real-World Examples:
- Study hours β Exam score
- Years of experience β Salary
- House size β Price
The Linear Model
The linear model is beautifully simple:
y = wΒ·x + b
Breaking it down:
- x = input (study hours)
- w = weight (slope of the line)
- b = bias (y-intercept, starting value)
- y = predicted output (exam score)
Dataset Example
import numpy as np
import matplotlib.pyplot as plt
# Input features (study hours)
X = np.array([1, 2, 3, 4], dtype=float)
# Target values (exam scores)
y = np.array([50, 60, 70, 80], dtype=float)
print("Study Hours:", X)
print("Exam Scores:", y)
Code Explanation:
-
X: Input features (what we know) -
y: Target values (what we want to predict) - Pattern: Each hour adds 10 points (perfect linear relationship)
First Bad Guess
Before learning, the AI starts with random values:
w = 0.0 # Weight (slope)
b = 0.0 # Bias (y-intercept)
# Make predictions
y_pred = w * X + b
print("Bad predictions:", y_pred) # [0. 0. 0. 0.]
Problem: The AI predicts 0 for everything, which is completely wrong!
Error Calculation
We use Mean Squared Error (MSE) to measure how wrong we are:
error = np.mean((y_pred - y) ** 2)
print(f"Mean Squared Error: {error:.2f}")
Code Explanation:
-
(y_pred - y): Prediction errors -
** 2: Square each error (always positive, penalizes large errors) -
np.mean(...): Average across all data points - Lower MSE = better predictions
Why Squared?
- Always positive (no cancellation)
- Penalizes large errors more
- Smooth function (easier to optimize)
Gradient Descent (How AI Learns)
Think of error as a hill:
- Top of hill: High error (bad predictions)
- Bottom of valley: Low error (good predictions)
- Goal: Find the lowest point
The gradient tells us which direction to move:
# Initialize
w = 0.0
b = 0.0
lr = 0.01 # Learning rate (how big steps to take)
errors = []
# Training loop
for epoch in range(1000):
# Forward pass: make predictions
y_pred = w * X + b
# Calculate gradients
dw = np.mean((y_pred - y) * X) # Gradient for weight
db = np.mean(y_pred - y) # Gradient for bias
# Update weights and bias
w -= lr * dw # Move in opposite direction of gradient
b -= lr * db
# Calculate and store error
error = np.mean((y_pred - y) ** 2)
errors.append(error)
print(f"Final w (weight): {w:.4f}")
print(f"Final b (bias): {b:.4f}")
print(f"Final error: {errors[-1]:.4f}")
Code Explanation:
-
Forward Pass:
y_pred = w * X + b- Makes predictions -
Gradient Calculation:
-
dw = np.mean((y_pred - y) * X): How error changes with weight -
db = np.mean(y_pred - y): How error changes with bias
-
-
Weight Updates:
-
w -= lr * dw: Move weight in direction that reduces error -
b -= lr * db: Move bias in direction that reduces error
-
-
Learning Rate (
lr): Controls step size- Too high: Overshoots optimal value
- Too low: Takes too long to converge
Expected Output:
Final w (weight): 10.0000
Final b (bias): 40.0000
Final error: 0.0000
Interpretation:
-
w = 10: For each additional hour, score increases by 10 points -
b = 40: Starting score (when hours = 0) -
Error β 0: Perfect fit! (Data is perfectly linear)
Learning Curve
The learning curve shows how error decreases over time:
plt.plot(errors)
plt.title("Learning Curve")
plt.xlabel("Epoch")
plt.ylabel("Error (MSE)")
plt.grid(True, alpha=0.3)
plt.show()
Key Patterns:
- Rapid decrease (early epochs): AI learns quickly
- Slower decrease (middle epochs): Fine-tuning
- Plateau (late epochs): Converged to optimal solution
Making Predictions
After training, we can predict for new data:
# New student: studied 5 hours
study_hours = 5
predicted_score = w * study_hours + b
print(f"Predicted score for {study_hours} hours: {predicted_score:.2f}")
# Output: Predicted score for 5 hours: 90.00
Code Explanation:
- Use learned
wandbvalues - Apply to new input (not in training set)
- This tests if the model can generalize
Key Takeaways from Step 1
β
Linear Regression: Predicts numbers using a straight line
β
Error Measurement: MSE quantifies prediction quality
β
Gradient Descent: How AI learns by reducing error
β
Training Process: Iterative weight updates
β
Learning Curves: Visualize training progress
β
Making Predictions: Use learned model for new data
Limitations:
- β Only works for linear relationships
- β Cannot handle non-linear patterns
Step 2: Perceptron
The Big Idea
Linear Regression predicts numbers:
- "This student will score 85 points"
Perceptron makes decisions:
- "This student will PASS" or "This student will FAIL"
The Perceptron Model
The perceptron uses the same equation as before:
z = x Β· w + b
But then applies a decision rule:
If z β₯ 0 β output = 1 (YES/PASS)
If z < 0 β output = 0 (NO/FAIL)
Dataset Example
# Input features (study hours)
X = np.array([1, 2, 3, 4])
# Target labels (0 = Fail, 1 = Pass)
y = np.array([0, 0, 1, 1])
print("Study Hours:", X)
print("Pass (0=Fail, 1=Pass):", y)
Pattern: Students who study 3+ hours pass, others fail.
Step Function (Decision Maker)
The step function converts any number into a binary decision:
def step_function(z):
"""
Converts a score into a decision
Parameters:
z: calculated score
Returns:
1 if z >= 0 (YES/PASS)
0 if z < 0 (NO/FAIL)
"""
return 1 if z >= 0 else 0
# Test
test_values = [-5, -1, 0, 1, 5]
for val in test_values:
result = step_function(val)
decision = "YES" if result == 1 else "NO"
print(f"step({val:3d}) = {result} ({decision})")
Output:
step( -5) = 0 (NO)
step( -1) = 0 (NO)
step( 0) = 1 (YES)
step( 1) = 1 (YES)
step( 5) = 1 (YES)
Key Properties:
- Hard threshold: Sharp transition at z = 0
- Binary output: Only 0 or 1
- No middle ground: Can't express uncertainty
Making Initial Predictions
w = 1.0 # Weight
b = -2.5 # Bias
predictions = []
for x in X:
z = w * x + b
pred = step_function(z)
predictions.append(pred)
print(f"Hours: {x}, z = {z:.1f}, Prediction: {pred}")
print(f"\nPredictions: {predictions}")
print(f"Actual: {y}")
print(f"Correct: {np.array(predictions) == y}")
Code Explanation:
- Calculate score:
z = w * x + b - Apply step function:
pred = step_function(z) - Compare with actual labels
Decision Boundary
The decision boundary is the point where the perceptron switches decisions:
# Calculate boundary
boundary = (-b / w) if w != 0 else 0
print(f"Decision boundary: x = {boundary:.2f}")
print(f"Students with x < {boundary:.2f} β FAIL")
print(f"Students with x β₯ {boundary:.2f} β PASS")
Visual Representation:
Pass/Fail
1 β β β
β βββββΌβββ (Decision Boundary)
0 β β β
βββββββββββββββββββββββββ
1 2 3 4 5 Hours
Training the Perceptron
The perceptron updates weights when it makes a mistake:
# Initialize
w = 0.0
b = 0.0
lr = 0.1 # Learning rate
for epoch in range(10):
epoch_errors = 0
for i in range(len(X)):
# Forward pass
z = w * X[i] + b
y_pred = step_function(z)
# Calculate error
error = y[i] - y_pred
# Update weights (only if wrong)
if error != 0:
w += lr * error * X[i]
b += lr * error
epoch_errors += 1
# Calculate accuracy
correct = sum(1 for i in range(len(X))
if step_function(w * X[i] + b) == y[i])
accuracy = correct / len(X) * 100
print(f"Epoch {epoch:2d}: w={w:6.2f}, b={b:6.2f}, "
f"Errors={epoch_errors}, Accuracy={accuracy:.0f}%")
Code Explanation:
- Perceptron Learning Rule: Only update when prediction is wrong
-
Update Formula:
-
w += lr * error * X[i]: Adjust weight based on error and input -
b += lr * error: Adjust bias based on error
-
- When actual = 1 but predicted = 0: Increase weight and bias
- When actual = 0 but predicted = 1: Decrease weight and bias
- When prediction is correct: No change needed
Expected Output:
Epoch 0: w= 0.30, b= 0.20, Errors=2, Accuracy=50%
Epoch 1: w= 0.50, b= 0.30, Errors=1, Accuracy=75%
Epoch 2: w= 0.50, b= 0.30, Errors=0, Accuracy=100%
...
Key Insight: Perceptron stops updating when all predictions are correct!
Limitations of the Perceptron
What Perceptron Can Do:
β
Linearly separable problems: Can solve if a straight line can separate classes
What Perceptron Cannot Do:
β Non-linearly separable problems: Cannot solve XOR problem
The XOR Problem:
xβ | xβ | Output
---|----|-------
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 0
Visual Representation:
xβ
1 β β β
β
β β β
0 ββββββββββββββββββ xβ
0 1
Problem: No single straight line can separate the classes!
This limitation led to the development of:
- Multi-layer perceptrons (hidden layers)
- Neural networks (multiple layers)
- Deep learning (many layers)
Key Takeaways from Step 2
β
Perceptron: Makes binary decisions (YES/NO)
β
Step Function: Converts scores to decisions
β
Decision Boundary: Line that separates classes
β
Learning Rule: How perceptron updates weights
β
Training Process: Iterative learning from mistakes
β
Limitations: Can only solve linearly separable problems
Step 3: Logistic Regression
The Big Idea
The perceptron says:
- YES (1) or NO (0)
But real AI should say:
"I am 85% confident"
That's exactly what Logistic Regression does.
From Perceptron to Logistic Regression
Same core equation:
z = x Β· w + b
But instead of a step function, we use Sigmoid.
Sigmoid Function (Probability Maker)
The sigmoid function converts any number into a value between 0 and 1:
import numpy as np
def sigmoid(z):
"""
Sigmoid activation function
Parameters:
z: input score (any number)
Returns:
Probability between 0 and 1
"""
return 1 / (1 + np.exp(-z))
# Test
test_values = [-5, -2, 0, 2, 5]
for val in test_values:
prob = sigmoid(val)
print(f"sigmoid({val:3d}) = {prob:.4f}")
Output:
sigmoid( -5) = 0.0067
sigmoid( -2) = 0.1192
sigmoid( 0) = 0.5000
sigmoid( 2) = 0.8808
sigmoid( 5) = 0.9933
Code Explanation:
-
np.exp(-z): Exponential function (e raised to -z power) -
1 / (1 + np.exp(-z)): Sigmoid formula -
Why this works:
- When z is large positive: exp(-z) β 0, so result β 1
- When z is large negative: exp(-z) β β, so result β 0
- When z = 0: exp(0) = 1, so result = 0.5
- Output is always between 0 and 1 (perfect for probabilities!)
Visual Representation:
Sigmoid Function:
Probability
1 β ββββββββββββββββ
β β±
0.5 βββββ±
β β±
0 βββ±
ββββββββββββββββββββββββ z (score)
-5 -3 -1 0 1 3 5
Key Properties:
- Smooth: No sharp transitions
- Bounded: Always between 0 and 1
- Symmetric: Around z = 0
Forward Pass (Prediction)
# Dataset
X = np.array([1, 2, 3, 4], dtype=float)
y = np.array([0, 0, 1, 1], dtype=float)
# Initialize
w = 0.0
b = 0.0
# Forward pass
z = w * X + b
y_pred = sigmoid(z)
print("Probabilities:", y_pred)
Code Explanation:
- Same calculation as before:
z = w * X + b - But now apply sigmoid:
y_pred = sigmoid(z) - Outputs are probabilities (0 to 1), not just 0 or 1!
Loss Function (Binary Cross-Entropy)
We use Binary Cross-Entropy Loss instead of MSE:
def binary_cross_entropy(y, y_pred):
"""
Binary cross-entropy loss function
Parameters:
y: actual labels (0 or 1)
y_pred: predicted probabilities (0 to 1)
Returns:
Loss value (lower is better)
"""
return -np.mean(y * np.log(y_pred + 1e-9) +
(1 - y) * np.log(1 - y_pred + 1e-9))
loss = binary_cross_entropy(y, y_pred)
print("Loss:", loss)
Code Explanation:
-
y * np.log(y_pred + 1e-9): Loss when actual = 1- If actual = 1: This term contributes
- If actual = 0: This term is 0
-
(1 - y) * np.log(1 - y_pred + 1e-9): Loss when actual = 0- If actual = 0: This term contributes
- If actual = 1: This term is 0
-
+ 1e-9: Small number to avoid log(0) = -infinity -
np.mean(...): Average across all data points -
-: Negate (because we want to maximize log-likelihood)
Intuition:
- Confident & correct β small loss (log of number close to 1)
- Confident & wrong β BIG loss (log of number close to 0)
- Uncertain (0.5) β medium loss
Training with Gradient Descent
lr = 0.1
w = 0.0
b = 0.0
losses = []
for epoch in range(1000):
# Forward pass
z = w * X + b
y_pred = sigmoid(z)
# Calculate gradients
dw = np.mean((y_pred - y) * X) # Same as linear regression!
db = np.mean(y_pred - y)
# Update weights
w -= lr * dw
b -= lr * db
# Track loss
loss = binary_cross_entropy(y, y_pred)
losses.append(loss)
print(f"Final w: {w:.4f}")
print(f"Final b: {b:.4f}")
print(f"Final loss: {losses[-1]:.4f}")
Code Explanation:
- Same gradient formula as linear regression!
- But
y_predis now a probability (from sigmoid) - Gradient descent works the same way
- Loss decreases over time
From Probability to Decision
After getting probabilities, we can make decisions:
threshold = 0.5
decisions = (final_probs >= threshold).astype(int)
print("Probabilities:", final_probs)
print("Decisions:", decisions)
Code Explanation:
-
final_probs >= threshold: Boolean array (True/False) -
.astype(int): Convert to 0/1 - Important: Decision comes after probability
ROC Curve (Model Evaluation)
The ROC curve shows model performance across different thresholds:
def calculate_roc_curve(y_true, y_scores):
"""Calculate ROC curve points"""
# Sort by scores (descending)
sorted_indices = np.argsort(y_scores)[::-1]
y_true_sorted = y_true[sorted_indices]
y_scores_sorted = y_scores[sorted_indices]
# Calculate TPR and FPR for each threshold
thresholds = np.unique(y_scores_sorted)
thresholds = np.append(thresholds, [1.0, 0.0])
thresholds = np.sort(thresholds)[::-1]
tpr = [] # True Positive Rate
fpr = [] # False Positive Rate
for threshold in thresholds:
y_pred = (y_scores_sorted >= threshold).astype(int)
tp = np.sum((y_pred == 1) & (y_true_sorted == 1))
fp = np.sum((y_pred == 1) & (y_true_sorted == 0))
tn = np.sum((y_pred == 0) & (y_true_sorted == 0))
fn = np.sum((y_pred == 0) & (y_true_sorted == 1))
tpr_val = tp / (tp + fn) if (tp + fn) > 0 else 0
fpr_val = fp / (fp + tn) if (fp + tn) > 0 else 0
tpr.append(tpr_val)
fpr.append(fpr_val)
# Calculate AUC using trapezoidal rule
auc = np.trapz(tpr, fpr)
return np.array(fpr), np.array(tpr), auc
# Use it
final_probs_all = sigmoid(w * X + b)
fpr, tpr, auc_score = calculate_roc_curve(y, final_probs_all)
print(f"AUC Score: {auc_score:.3f}")
print(" AUC = 1.0: Perfect classifier")
print(" AUC = 0.5: Random classifier")
print(" AUC > 0.7: Good classifier")
Code Explanation:
- TPR (True Positive Rate): How often we correctly predict positive
- FPR (False Positive Rate): How often we incorrectly predict positive
-
AUC (Area Under Curve): Single number summarizing performance
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random classifier
- AUC > 0.7: Good classifier
Why Logistic Regression is Better Than Perceptron
β
Smooth learning: No sharp transitions
β
Probability output: Expresses confidence
β
Stable training: Gradients are well-behaved
β Still only straight-line separation: Same limitation as perceptron
Key Takeaways from Step 3
β
Logistic Regression: Makes decisions with confidence (probabilities)
β
Sigmoid Function: Converts scores to probabilities
β
Binary Cross-Entropy: Loss function for probabilities
β
ROC Curve: Evaluates model performance
β
Probability β Decision: Convert probabilities to binary decisions
Step 4: Multiple Neurons
The Big Idea
So far, we used one neuron. But real AI works like a team of neurons:
- Each neuron looks at the data differently
- Together they make better decisions
A Neural Network = Many neurons + Math + Repetition
From One Neuron to Many
Single neuron:
z = x Β· w + b
Multiple neurons (layer):
Z = X Β· W + b
Where:
- X = input matrix (many samples)
- W = weight matrix (many neurons)
- b = bias vector
Dataset Example
# Features: Math score, Science score
# Target: Pass (1) or Fail (0)
X = np.array([
[80, 70], # Student 1
[60, 65], # Student 2
[90, 95], # Student 3
[50, 45] # Student 4
], dtype=float)
y = np.array([[1], [0], [1], [0]])
print("X shape:", X.shape) # (4, 2) = 4 students, 2 features
print("y shape:", y.shape) # (4, 1) = 4 students, 1 output
Code Explanation:
-
X: 2D matrix (4 students Γ 2 features) - Each row is one student:
[math_score, science_score] -
y: Target labels (2D array for matrix compatibility)
Weight Matrix (Many Neurons)
Let's use 3 neurons in one layer:
W = np.random.randn(2, 3) # 2 features β 3 neurons
b = np.zeros((1, 3))
print("W shape:", W.shape) # (2, 3)
print("b shape:", b.shape) # (1, 3)
Code Explanation:
-
W = np.random.randn(2, 3): Weight matrix- Shape
(2, 3): 2 input features β 3 neurons - Each column represents weights for one neuron
- Shape
-
b = np.zeros((1, 3)): Bias vector- Shape
(1, 3): One bias per neuron (3 biases total)
- Shape
Forward Pass (Matrix Multiplication)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Forward pass
Z = X @ W + b
A = sigmoid(Z)
print("Z shape:", Z.shape) # (4, 3) = 4 students, 3 neuron scores
print("A (outputs):\n", A)
Code Explanation:
-
Z = X @ W + b: Matrix multiplication-
X @ W: (4Γ2) @ (2Γ3) = (4Γ3) - Each row of X Γ weight matrix = scores for 3 neurons
- Result: 4 students Γ 3 neuron scores
-
+ b: Broadcasting adds bias to each neuron
-
-
A = sigmoid(Z): Apply activation function- Converts scores to activations (probabilities)
- Applied element-wise to entire matrix
-
Ashape:(4, 3)= 4 students, 3 neuron activations
Key insight: Each column in A = output of one neuron across all students.
Single Output Neuron (Combining the Layer)
Now we combine the neuron layer into one output neuron:
W_out = np.random.randn(3, 1) # 3 hidden neurons β 1 output
b_out = np.zeros((1, 1))
Z_out = A @ W_out + b_out
y_pred = sigmoid(Z_out)
print("Final output probabilities:\n", y_pred)
Code Explanation:
-
W_out = np.random.randn(3, 1): Output layer weights- Shape
(3, 1): 3 hidden neurons β 1 output neuron
- Shape
-
Z_out = A @ W_out + b_out: Calculate final scores-
A @ W_out: (4Γ3) @ (3Γ1) = (4Γ1) - Each student's 3 neuron activations β 1 final score
-
-
y_pred = sigmoid(Z_out): Convert to probabilities- Final predictions as probabilities (0 to 1)
Network flow: Input (4Γ2) β Hidden (4Γ3) β Output (4Γ1)
Training the Network
lr = 0.1
losses = []
for epoch in range(1000):
# Forward
Z = X @ W + b
A = sigmoid(Z)
Z_out = A @ W_out + b_out
y_pred = sigmoid(Z_out)
# Loss
loss = binary_cross_entropy(y, y_pred)
losses.append(loss)
# Backpropagation (simplified)
dZ_out = y_pred - y
dW_out = A.T @ dZ_out
db_out = np.mean(dZ_out, axis=0)
dA = dZ_out @ W_out.T
dZ = dA * A * (1 - A) # Sigmoid derivative
dW = X.T @ dZ
db = np.mean(dZ, axis=0)
# Update
W_out -= lr * dW_out
b_out -= lr * db_out
W -= lr * dW
b -= lr * db
Code Explanation:
- Forward Pass: Calculate predictions through all layers
-
Backpropagation: Calculate gradients for all layers
- Start from output layer and work backwards
- Use chain rule from calculus
- Update: Adjust all weights and biases
- Key: Matrix operations make this efficient!
Why This Matters
β
First real neural network: Multiple neurons working together
β
Multiple neurons learn different patterns: Each neuron specializes
β
Matrix math = speed + power: Process all data at once
β Still limited to simple patterns: Need hidden layers for complex problems
Key Takeaways from Step 4
β
Neural Network Layer: Multiple neurons working together
β
Matrix Operations: Efficient processing of multiple samples
β
Forward Pass: Data flows through network
β
Backpropagation: Gradients flow backwards
β
Multi-layer Networks: Combine layers for more power
Step 5: Hidden Layers & XOR
The Big Idea (The WOW Moment)
Some problems cannot be solved with a single straight line.
This famous problem is called XOR.
If perceptron fails β and a deeper network succeeds β ,
then depth really matters.
The XOR Problem
XOR truth table:
| xβ | xβ | XOR |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
X = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
], dtype=float)
y = np.array([[0], [1], [1], [0]])
Visual Representation:
xβ
1 β β β
β
β β β
0 ββββββββββββββββββ xβ
0 1
Problem: No single straight line can separate the classes!
Try a Single-Layer Model (It Will Fail)
W = np.random.randn(2, 1)
b = np.zeros((1,))
lr = 0.1
losses = []
for epoch in range(2000):
z = X @ W + b
y_pred = sigmoid(z)
loss = binary_cross_entropy(y, y_pred)
losses.append(loss)
dW = X.T @ (y_pred - y)
db = np.mean(y_pred - y)
W -= lr * dW
b -= lr * db
print("Single-layer predictions:", (y_pred >= 0.5).astype(int))
β The model cannot solve XOR. Loss never reaches zero.
Adding a Hidden Layer (Deep Learning)
Now we add one hidden layer with non-linearity.
Architecture:
- Input layer (2 neurons)
- Hidden layer (4 neurons)
- Output layer (1 neuron)
Initialize the Network
W1 = np.random.randn(2, 4) # Input β Hidden
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) # Hidden β Output
b2 = np.zeros((1, 1))
lr = 0.1
losses = []
Code Explanation:
-
W1 = np.random.randn(2, 4): First layer weights (input β hidden)- Shape
(2, 4): 2 input features β 4 hidden neurons
- Shape
-
b1 = np.zeros((1, 4)): First layer biases- Shape
(1, 4): One bias per hidden neuron
- Shape
-
W2 = np.random.randn(4, 1): Second layer weights (hidden β output)- Shape
(4, 1): 4 hidden neurons β 1 output neuron
- Shape
-
b2 = np.zeros((1, 1)): Second layer bias- Shape
(1, 1): Single bias for output neuron
- Shape
Architecture: Input(2) β Hidden(4) β Output(1)
Forward Pass
def forward(X):
Z1 = X @ W1 + b1
A1 = sigmoid(Z1) # Hidden layer activation
Z2 = A1 @ W2 + b2
A2 = sigmoid(Z2) # Output layer activation
return A1, A2
Code Explanation:
-
First Layer:
Z1 = X @ W1 + b1, thenA1 = sigmoid(Z1)- Input (4Γ2) β Hidden (4Γ4)
-
Second Layer:
Z2 = A1 @ W2 + b2, thenA2 = sigmoid(Z2)- Hidden (4Γ4) β Output (4Γ1)
- Non-linearity: Sigmoid in hidden layer is crucial!
Training with Backpropagation
for epoch in range(5000):
A1, y_pred = forward(X)
loss = binary_cross_entropy(y, y_pred)
losses.append(loss)
# Backpropagation
# Output layer gradients
dZ2 = y_pred - y
dW2 = A1.T @ dZ2
db2 = np.mean(dZ2, axis=0)
# Hidden layer gradients
dA1 = dZ2 @ W2.T
dZ1 = dA1 * A1 * (1 - A1) # Sigmoid derivative
dW1 = X.T @ dZ1
db1 = np.mean(dZ1, axis=0)
# Update weights
W2 -= lr * dW2
b2 -= lr * db2
W1 -= lr * dW1
b1 -= lr * db1
Code Explanation:
-
Backpropagation: Calculate gradients backwards through network
- Start from output layer:
dZ2 = y_pred - y - Propagate to hidden layer:
dA1 = dZ2 @ W2.T - Apply sigmoid derivative:
dZ1 = dA1 * A1 * (1 - A1)
- Start from output layer:
- Update: Adjust all weights and biases
- Key: Hidden layer allows network to learn non-linear patterns!
Final Predictions (SUCCESS π)
_, final_probs = forward(X)
final_preds = (final_probs >= 0.5).astype(int)
print("Final probabilities:\n", final_probs)
print("Final predictions:\n", final_preds)
print("Actual values:\n", y)
β The network solves XOR!
Expected Output:
Final probabilities:
[[0.01]
[0.99]
[0.99]
[0.01]]
Final predictions:
[[0]
[1]
[1]
[0]]
Actual values:
[[0]
[1]
[1]
[0]]
Perfect match! π
What Just Happened?
- Hidden neurons learned intermediate patterns: Each hidden neuron detects a different feature
- Network combined them to solve XOR: Output neuron combines hidden neuron outputs
- This is the foundation of deep learning: Depth creates new feature spaces
π§ Key insight:
Depth creates new feature spaces. Hidden layers transform the input into a space where the problem becomes linearly separable.
Understanding Overfitting
Overfitting occurs when a model learns training data too well and fails to generalize:
# Signs of overfitting:
# - Training loss continues decreasing
# - Validation loss decreases then increases
# - Model memorizes training data
# Solutions:
# 1. Early stopping: Stop when validation loss stops improving
# 2. Regularization: Add penalty for large weights (L1/L2)
# 3. Dropout: Randomly disable neurons during training
# 4. More data: Collect more training examples
# 5. Simpler model: Reduce model complexity
Why Step 5 Is Critical
β
Explains why deep learning exists
β
Shows limits of shallow models
β
Makes hidden layers intuitive
β
Creates a lasting "Aha!" moment
Key Takeaways from Step 5
β
XOR Problem: Cannot be solved with single layer
β
Hidden Layers: Enable non-linear pattern learning
β
Backpropagation: Gradients flow backwards through network
β
Deep Learning: Depth creates new feature spaces
β
Overfitting: Model can memorize instead of generalize
Step 6: PyTorch
The Big Idea
So far, you built everything by hand:
- Weights
- Forward pass
- Backpropagation
- Updates
Real AI engineers use frameworks like PyTorch to:
- Write less code
- Avoid bugs
- Train large models faster
π§ Important:
PyTorch does NOT replace understandingβit automates math you already know.
Installing PyTorch
pip install torch torchvision torchaudio
First PyTorch Tensor
A tensor is like a NumPy array, but smarter:
import torch
x = torch.tensor([1.0, 2.0, 3.0])
print(x)
print(type(x)) # <class 'torch.Tensor'>
Tensor vs NumPy:
| NumPy | PyTorch |
|---|---|
| array | tensor |
| CPU only | CPU / GPU |
| manual gradients | automatic gradients |
Autograd (Automatic Gradients)
PyTorch can calculate gradients automatically:
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x
y.backward()
print("dy/dx =", x.grad) # dy/dx = 7.0
Code Explanation:
-
requires_grad=True: Tell PyTorch to track gradients -
y.backward(): Calculate gradients automatically -
x.grad: Access the gradient - This replaces manual derivative calculations!
Dataset Example (XOR Again)
We reuse XOR to prove PyTorch works:
X = torch.tensor([[0., 0.],
[0., 1.],
[1., 0.],
[1., 1.]])
y = torch.tensor([[0.],
[1.],
[1.],
[0.]])
Defining a Neural Network
import torch.nn as nn
model = nn.Sequential(
nn.Linear(2, 4), # input β hidden (2 features β 4 neurons)
nn.ReLU(), # activation function
nn.Linear(4, 1), # hidden β output (4 neurons β 1 output)
nn.Sigmoid() # probability output
)
print(model)
Code Explanation:
-
nn.Sequential: Layers stacked sequentially -
nn.Linear(2, 4): Linear layer (weights + bias)- 2 input features β 4 output neurons
- Equivalent to:
W @ X + b
-
nn.ReLU(): Rectified Linear Unit activation- ReLU(x) = max(0, x)
- Faster than sigmoid, helps with deep networks
-
nn.Linear(4, 1): Output layer- 4 hidden neurons β 1 output
-
nn.Sigmoid(): Convert to probability
π§ Connection to previous steps:
- Linear = weights + bias (from Steps 0-5)
- ReLU/Sigmoid = activation (from Steps 3-5)
- Layers = matrices (from Step 4)
Loss Function & Optimizer
loss_fn = nn.BCELoss() # Binary Cross-Entropy Loss
optimizer = torch.optim.SGD(
model.parameters(), # All weights and biases
lr=0.1 # Learning rate
)
Code Explanation:
-
nn.BCELoss(): Binary Cross-Entropy Loss (from Step 3) -
torch.optim.SGD: Stochastic Gradient Descent optimizer-
model.parameters(): All weights and biases in the model -
lr=0.1: Learning rate
-
| Concept | PyTorch |
|---|---|
| Loss | nn.BCELoss() |
| Gradient Descent | optim.SGD |
| Weights | model.parameters() |
Training Loop (Very Important)
losses = []
for epoch in range(3000):
# Forward pass
y_pred = model(X)
loss = loss_fn(y_pred, y)
losses.append(loss.item())
# Backward pass
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Calculate gradients
optimizer.step() # Update weights
print(f"Final loss: {losses[-1]:.4f}")
Code Explanation:
-
Forward Pass:
y_pred = model(X)- Data flows through all layers automatically
-
Loss Calculation:
loss = loss_fn(y_pred, y)- Compare predictions with actual values
-
Backward Pass:
-
optimizer.zero_grad(): Clear gradients from previous iteration -
loss.backward(): Calculate gradients automatically (autograd!) -
optimizer.step(): Update all weights and biases
-
- This loop replaces everything you coded manually before!
Learning Curve
import matplotlib.pyplot as plt
plt.plot(losses)
plt.title("Training Loss (PyTorch)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True, alpha=0.3)
plt.show()
β Loss decreases β model is learning.
Final Predictions
with torch.no_grad(): # Disable gradient computation (faster)
probs = model(X)
preds = (probs >= 0.5).int()
print("Probabilities:\n", probs)
print("Predictions:\n", preds)
print("Actual:\n", y.int())
Code Explanation:
-
torch.no_grad(): Disable gradient tracking (faster for inference) -
model(X): Make predictions -
(probs >= 0.5).int(): Convert probabilities to binary decisions
π PyTorch solved XOR successfully!
Model Saving and Loading
Saving a Model:
model_path = "xor_model.pth"
torch.save({
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': losses[-1],
}, model_path)
print(f"β
Model saved to {model_path}")
Code Explanation:
-
model.state_dict(): Dictionary containing all model weights -
optimizer.state_dict(): Optimizer state (learning rate, momentum, etc.) -
torch.save(): Save to file
Loading a Model:
# Create a new model instance (untrained)
new_model = nn.Sequential(
nn.Linear(2, 4),
nn.ReLU(),
nn.Linear(4, 1),
nn.Sigmoid()
)
# Load the saved model
checkpoint = torch.load(model_path)
new_model.load_state_dict(checkpoint['model_state_dict'])
print("β
Model loaded successfully!")
# Test loaded model
with torch.no_grad():
loaded_probs = new_model(X)
loaded_preds = (loaded_probs >= 0.5).int()
loaded_accuracy = (loaded_preds == y.int()).float().mean().item()
print(f"Loaded model accuracy: {loaded_accuracy:.2%}")
Code Explanation:
-
torch.load(): Load checkpoint from file -
checkpoint['model_state_dict']: Extract model weights -
new_model.load_state_dict(): Load weights into model - Model now has same weights as saved model!
Comparing Scratch vs PyTorch
| From Scratch | PyTorch |
|---|---|
| Manual gradients | Automatic |
| More code | Less code |
| Easy to make mistakes | Safer |
| Best for learning | Best for real projects |
π§ Rule:
Learn from scratch β build with PyTorch.
Key Takeaways from Step 6
β
PyTorch: Professional AI framework
β
Tensors: Like NumPy arrays but with gradients
β
Autograd: Automatic gradient calculation
β
nn.Module: Define neural networks easily
β
Training Loop: Forward β Loss β Backward β Update
β
Model Saving: Save and load trained models
The Complete Picture
The Journey So Far
- Step 0: Math foundations (vectors, weights, dot product, bias)
- Step 1: Linear Regression (predicting numbers, gradient descent)
- Step 2: Perceptron (binary decisions, step function)
- Step 3: Logistic Regression (probabilistic decisions, sigmoid)
- Step 4: Multiple Neurons (neural network layers, matrix operations)
- Step 5: Hidden Layers (deep learning, XOR problem, backpropagation)
- Step 6: PyTorch (professional framework, autograd, training loops)
How Everything Connects
Step 0: Math Foundations
β
Step 1: Linear Regression (predict numbers)
β
Step 2: Perceptron (make decisions)
β
Step 3: Logistic Regression (decisions with confidence)
β
Step 4: Multiple Neurons (neural network layers)
β
Step 5: Hidden Layers (deep learning, solve XOR)
β
Step 6: PyTorch (professional tools)
Core Concepts You've Mastered
-
Mathematical Foundations
- Vectors, matrices, dot products
- Weights, bias, neurons
- Matrix operations
-
Learning Algorithms
- Gradient descent
- Backpropagation
- Weight updates
-
Neural Networks
- Single neurons
- Multiple neurons (layers)
- Hidden layers (deep networks)
- Forward and backward passes
-
Loss Functions
- Mean Squared Error (MSE)
- Binary Cross-Entropy
- When to use each
-
Activation Functions
- Step function (perceptron)
- Sigmoid (logistic regression)
- ReLU (deep networks)
-
Professional Tools
- PyTorch framework
- Automatic gradients
- Model saving/loading
What You Can Build Now
β
Linear Regression Models: Predict numbers from data
β
Binary Classifiers: Make YES/NO decisions
β
Probabilistic Classifiers: Express confidence in decisions
β
Neural Networks: Multi-layer networks with hidden layers
β
Deep Learning Models: Solve non-linear problems like XOR
β
PyTorch Models: Professional AI applications
Key Takeaways
Mathematical Understanding
You now understand:
- How AI represents data (vectors, matrices)
- How AI combines information (dot product, matrix multiplication)
- How AI makes adjustments (bias, weights)
- How AI learns (gradient descent, backpropagation)
Coding Skills
You can now:
- Build neural networks from scratch
- Use PyTorch for professional development
- Understand and modify existing AI code
- Debug training issues
- Save and load models
Conceptual Mastery
You understand:
- Why deep learning exists (XOR problem)
- How neural networks learn (gradient descent)
- When to use different activation functions
- How to evaluate models (loss, accuracy, ROC curves)
- The difference between training and inference
Next Steps
After completing Steps 0-6, you're ready for:
- Step 7: Recurrent Neural Networks (RNNs) for sequences
- Step 8: Convolutional Neural Networks (CNNs) for images
- Advanced Topics: Transformers, GANs, Reinforcement Learning
- Real Projects: Build your own AI applications
Conclusion
Congratulations! You've completed a comprehensive journey from basic mathematical concepts to building neural networks with PyTorch.
What makes this journey special:
- No black boxes: You understand every concept from first principles
- Progressive learning: Each step builds naturally on the previous
- Code + Concepts: Both implementation and theory are covered
- Real understanding: You know how AI works, not just how to use it
Remember:
- Every expert was once a beginner
- Understanding > memorization
- Practice makes perfect
- Keep building projects!
π You are now ready to build real AI applications!
This article covers Steps 0-6 of the AI Bootcamp. For advanced topics (RNNs, CNNs, Transformers, etc.), continue with Steps 7-13.
Top comments (0)