DEV Community

Hassan
Hassan

Posted on

From Zero to AI: A Complete Journey

From Zero to AI: A Complete Journey

πŸ“š Table of Contents

  1. Introduction
  2. Step 0: Math Foundations for AI
  3. Step 1: Linear Regression
  4. Step 2: Perceptron
  5. Step 3: Logistic Regression
  6. Step 4: Multiple Neurons
  7. Step 5: Hidden Layers & XOR
  8. Step 6: PyTorch
  9. The Complete Picture
  10. Key Takeaways

Introduction

This article takes you on a complete journey from understanding basic mathematical concepts to building neural networks with PyTorch. Each step builds on the previous one, creating a solid foundation for understanding how AI really works.

What makes this journey special:

  • No black boxes: Every concept is explained from first principles
  • Code + Concepts: Both the "what" and "why" are covered
  • Progressive learning: Each step naturally leads to the next
  • Real understanding: You'll know how AI works, not just how to use it

Step 0: Math Foundations for AI

The Big Idea

AI doesn't think like humansβ€”it performs mathematical operations. At its core, AI is just math with data.

The Fundamental Equation

Every neuron in AI follows this simple equation:

z = x Β· w + b
Enter fullscreen mode Exit fullscreen mode

Breaking it down:

  • x = inputs (features) - What you feed into the AI
  • w = weights (importance) - How much each feature matters
  • b = bias (starting push) - A constant adjustment
  • z = score (before making a decision) - The final calculation

Real-World Analogy

Imagine deciding if a student passes a class:

import numpy as np

# Student's scores
math_score = 80      # x₁
science_score = 70   # xβ‚‚
english_score = 75   # x₃

# How important each subject is
math_weight = 0.6    # w₁ (most important)
science_weight = 0.3 # wβ‚‚
english_weight = 0.1 # w₃

# Starting adjustment
bias = -50           # b

# Calculate final score
final_score = (80 Γ— 0.6) + (70 Γ— 0.3) + (75 Γ— 0.1) - 50
            = 48 + 21 + 7.5 - 50
            = 26.5
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • import numpy as np: Imports NumPy for numerical operations
  • Each feature (math, science, english) has a weight
  • The dot product multiplies each feature by its weight
  • Bias adjusts the final score
  • If final_score β‰₯ 60, the student passes

Key Concepts

1. Vectors

A vector is a collection of numbers arranged in order:

# Single student's scores
x = np.array([80, 70, 75])
print("Student vector:", x)
print("Shape:", x.shape)  # (3,) = 3 elements
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • np.array([80, 70, 75]): Creates a NumPy array (vector)
  • Each number represents one feature
  • Shape (3,) means 1D array with 3 elements

2. Weights

Weights determine how important each feature is:

# Weights (how important each subject is)
weights = np.array([0.6, 0.3, 0.1])
print("Weights:", weights)
Enter fullscreen mode Exit fullscreen mode

Key Properties:

  • Higher weight = more important feature
  • Weights are learned during training (we'll see this later)
  • Weights can be positive or negative

3. Dot Product

The dot product multiplies corresponding elements and sums them:

# Student scores
x = np.array([80, 70, 75])

# Weights
w = np.array([0.6, 0.3, 0.1])

# Dot product
z = np.dot(x, w)
# Equivalent to: (80Γ—0.6) + (70Γ—0.3) + (75Γ—0.1) = 76.5
print("Dot product:", z)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • np.dot(x, w): Calculates dot product
  • Alternative syntax: x @ w (matrix multiplication operator)
  • Result: Single number representing weighted sum

4. Bias

Bias is a constant adjustment to the score:

b = -50  # Negative bias (makes it harder to pass)
z_with_bias = z + b
print("Score with bias:", z_with_bias)
Enter fullscreen mode Exit fullscreen mode

Understanding Bias:

  • Positive bias: Makes it easier to get a high score
  • Negative bias: Makes it harder to get a high score
  • Zero bias: No adjustment

5. Building a Mini Neuron

A neuron is the basic building block of AI:

def neuron(x, w, b):
    """
    Simple neuron function

    Parameters:
    x: input features (vector)
    w: weights (vector)
    b: bias (scalar)

    Returns:
    z: output score
    """
    return np.dot(x, w) + b

# Test the neuron
x = np.array([80, 70, 75])
w = np.array([0.6, 0.3, 0.1])
b = -50

result = neuron(x, w, b)
print(f"Neuron output: {result}")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • def neuron(x, w, b):: Defines a reusable function
  • return np.dot(x, w) + b: The core calculation
  • This is the foundation of all AI!

Matrix Operations for Multiple Students

When dealing with multiple students, use matrix multiplication:

# Multiple students (each row is a student)
X = np.array([
    [90, 85, 70],  # Student 1
    [40, 50, 60],  # Student 2
    [75, 70, 80],  # Student 3
])

# Weights (same for all students)
w = np.array([0.6, 0.3, 0.1])
b = -50

# Matrix multiplication: X @ w
scores = X @ w + b
print("Scores for all students:", scores)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • X: 2D matrix (3 students Γ— 3 features)
  • X @ w: Matrix multiplication calculates dot product for each row
  • + b: Broadcasting adds bias to each score
  • Much faster than loops!

Key Takeaways from Step 0

βœ… Features: Real-world measurements used as inputs

βœ… Vectors: Collections of features

βœ… Weights: Importance of each feature

βœ… Dot Product: Efficient way to combine features and weights

βœ… Bias: Constant adjustment to the score

βœ… Neuron: Basic building block that calculates z = xΒ·w + b

βœ… Matrix Operations: Processing multiple examples efficiently


Step 1: Linear Regression

The Big Idea

Linear Regression answers: How can we predict a number as accurately as possible?

Real-World Examples:

  • Study hours β†’ Exam score
  • Years of experience β†’ Salary
  • House size β†’ Price

The Linear Model

The linear model is beautifully simple:

y = wΒ·x + b
Enter fullscreen mode Exit fullscreen mode

Breaking it down:

  • x = input (study hours)
  • w = weight (slope of the line)
  • b = bias (y-intercept, starting value)
  • y = predicted output (exam score)

Dataset Example

import numpy as np
import matplotlib.pyplot as plt

# Input features (study hours)
X = np.array([1, 2, 3, 4], dtype=float)

# Target values (exam scores)
y = np.array([50, 60, 70, 80], dtype=float)

print("Study Hours:", X)
print("Exam Scores:", y)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • X: Input features (what we know)
  • y: Target values (what we want to predict)
  • Pattern: Each hour adds 10 points (perfect linear relationship)

First Bad Guess

Before learning, the AI starts with random values:

w = 0.0  # Weight (slope)
b = 0.0  # Bias (y-intercept)

# Make predictions
y_pred = w * X + b
print("Bad predictions:", y_pred)  # [0. 0. 0. 0.]
Enter fullscreen mode Exit fullscreen mode

Problem: The AI predicts 0 for everything, which is completely wrong!

Error Calculation

We use Mean Squared Error (MSE) to measure how wrong we are:

error = np.mean((y_pred - y) ** 2)
print(f"Mean Squared Error: {error:.2f}")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • (y_pred - y): Prediction errors
  • ** 2: Square each error (always positive, penalizes large errors)
  • np.mean(...): Average across all data points
  • Lower MSE = better predictions

Why Squared?

  1. Always positive (no cancellation)
  2. Penalizes large errors more
  3. Smooth function (easier to optimize)

Gradient Descent (How AI Learns)

Think of error as a hill:

  • Top of hill: High error (bad predictions)
  • Bottom of valley: Low error (good predictions)
  • Goal: Find the lowest point

The gradient tells us which direction to move:

# Initialize
w = 0.0
b = 0.0
lr = 0.01  # Learning rate (how big steps to take)
errors = []

# Training loop
for epoch in range(1000):
    # Forward pass: make predictions
    y_pred = w * X + b

    # Calculate gradients
    dw = np.mean((y_pred - y) * X)  # Gradient for weight
    db = np.mean(y_pred - y)        # Gradient for bias

    # Update weights and bias
    w -= lr * dw  # Move in opposite direction of gradient
    b -= lr * db

    # Calculate and store error
    error = np.mean((y_pred - y) ** 2)
    errors.append(error)

print(f"Final w (weight): {w:.4f}")
print(f"Final b (bias): {b:.4f}")
print(f"Final error: {errors[-1]:.4f}")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Forward Pass: y_pred = w * X + b - Makes predictions
  • Gradient Calculation:
    • dw = np.mean((y_pred - y) * X): How error changes with weight
    • db = np.mean(y_pred - y): How error changes with bias
  • Weight Updates:
    • w -= lr * dw: Move weight in direction that reduces error
    • b -= lr * db: Move bias in direction that reduces error
  • Learning Rate (lr): Controls step size
    • Too high: Overshoots optimal value
    • Too low: Takes too long to converge

Expected Output:

Final w (weight): 10.0000
Final b (bias): 40.0000
Final error: 0.0000
Enter fullscreen mode Exit fullscreen mode

Interpretation:

  • w = 10: For each additional hour, score increases by 10 points
  • b = 40: Starting score (when hours = 0)
  • Error β‰ˆ 0: Perfect fit! (Data is perfectly linear)

Learning Curve

The learning curve shows how error decreases over time:

plt.plot(errors)
plt.title("Learning Curve")
plt.xlabel("Epoch")
plt.ylabel("Error (MSE)")
plt.grid(True, alpha=0.3)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Key Patterns:

  1. Rapid decrease (early epochs): AI learns quickly
  2. Slower decrease (middle epochs): Fine-tuning
  3. Plateau (late epochs): Converged to optimal solution

Making Predictions

After training, we can predict for new data:

# New student: studied 5 hours
study_hours = 5
predicted_score = w * study_hours + b
print(f"Predicted score for {study_hours} hours: {predicted_score:.2f}")
# Output: Predicted score for 5 hours: 90.00
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Use learned w and b values
  • Apply to new input (not in training set)
  • This tests if the model can generalize

Key Takeaways from Step 1

βœ… Linear Regression: Predicts numbers using a straight line

βœ… Error Measurement: MSE quantifies prediction quality

βœ… Gradient Descent: How AI learns by reducing error

βœ… Training Process: Iterative weight updates

βœ… Learning Curves: Visualize training progress

βœ… Making Predictions: Use learned model for new data

Limitations:

  • ❌ Only works for linear relationships
  • ❌ Cannot handle non-linear patterns

Step 2: Perceptron

The Big Idea

Linear Regression predicts numbers:

  • "This student will score 85 points"

Perceptron makes decisions:

  • "This student will PASS" or "This student will FAIL"

The Perceptron Model

The perceptron uses the same equation as before:

z = x Β· w + b
Enter fullscreen mode Exit fullscreen mode

But then applies a decision rule:

If z β‰₯ 0 β†’ output = 1 (YES/PASS)
If z < 0  β†’ output = 0 (NO/FAIL)
Enter fullscreen mode Exit fullscreen mode

Dataset Example

# Input features (study hours)
X = np.array([1, 2, 3, 4])

# Target labels (0 = Fail, 1 = Pass)
y = np.array([0, 0, 1, 1])

print("Study Hours:", X)
print("Pass (0=Fail, 1=Pass):", y)
Enter fullscreen mode Exit fullscreen mode

Pattern: Students who study 3+ hours pass, others fail.

Step Function (Decision Maker)

The step function converts any number into a binary decision:

def step_function(z):
    """
    Converts a score into a decision

    Parameters:
    z: calculated score

    Returns:
    1 if z >= 0 (YES/PASS)
    0 if z < 0  (NO/FAIL)
    """
    return 1 if z >= 0 else 0

# Test
test_values = [-5, -1, 0, 1, 5]
for val in test_values:
    result = step_function(val)
    decision = "YES" if result == 1 else "NO"
    print(f"step({val:3d}) = {result} ({decision})")
Enter fullscreen mode Exit fullscreen mode

Output:

step( -5) = 0 (NO)
step( -1) = 0 (NO)
step(  0) = 1 (YES)
step(  1) = 1 (YES)
step(  5) = 1 (YES)
Enter fullscreen mode Exit fullscreen mode

Key Properties:

  • Hard threshold: Sharp transition at z = 0
  • Binary output: Only 0 or 1
  • No middle ground: Can't express uncertainty

Making Initial Predictions

w = 1.0   # Weight
b = -2.5  # Bias

predictions = []
for x in X:
    z = w * x + b
    pred = step_function(z)
    predictions.append(pred)
    print(f"Hours: {x}, z = {z:.1f}, Prediction: {pred}")

print(f"\nPredictions: {predictions}")
print(f"Actual:      {y}")
print(f"Correct:     {np.array(predictions) == y}")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Calculate score: z = w * x + b
  • Apply step function: pred = step_function(z)
  • Compare with actual labels

Decision Boundary

The decision boundary is the point where the perceptron switches decisions:

# Calculate boundary
boundary = (-b / w) if w != 0 else 0

print(f"Decision boundary: x = {boundary:.2f}")
print(f"Students with x < {boundary:.2f} β†’ FAIL")
print(f"Students with x β‰₯ {boundary:.2f} β†’ PASS")
Enter fullscreen mode Exit fullscreen mode

Visual Representation:

Pass/Fail
   1 β”‚                    ●  ●
     β”‚                ────┼─── (Decision Boundary)
   0 β”‚    ●  ●
     └────────────────────────
       1   2   3   4   5   Hours
Enter fullscreen mode Exit fullscreen mode

Training the Perceptron

The perceptron updates weights when it makes a mistake:

# Initialize
w = 0.0
b = 0.0
lr = 0.1  # Learning rate

for epoch in range(10):
    epoch_errors = 0

    for i in range(len(X)):
        # Forward pass
        z = w * X[i] + b
        y_pred = step_function(z)

        # Calculate error
        error = y[i] - y_pred

        # Update weights (only if wrong)
        if error != 0:
            w += lr * error * X[i]
            b += lr * error
            epoch_errors += 1

    # Calculate accuracy
    correct = sum(1 for i in range(len(X)) 
                  if step_function(w * X[i] + b) == y[i])
    accuracy = correct / len(X) * 100

    print(f"Epoch {epoch:2d}: w={w:6.2f}, b={b:6.2f}, "
          f"Errors={epoch_errors}, Accuracy={accuracy:.0f}%")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Perceptron Learning Rule: Only update when prediction is wrong
  • Update Formula:
    • w += lr * error * X[i]: Adjust weight based on error and input
    • b += lr * error: Adjust bias based on error
  • When actual = 1 but predicted = 0: Increase weight and bias
  • When actual = 0 but predicted = 1: Decrease weight and bias
  • When prediction is correct: No change needed

Expected Output:

Epoch  0: w=  0.30, b=  0.20, Errors=2, Accuracy=50%
Epoch  1: w=  0.50, b=  0.30, Errors=1, Accuracy=75%
Epoch  2: w=  0.50, b=  0.30, Errors=0, Accuracy=100%
...
Enter fullscreen mode Exit fullscreen mode

Key Insight: Perceptron stops updating when all predictions are correct!

Limitations of the Perceptron

What Perceptron Can Do:
βœ… Linearly separable problems: Can solve if a straight line can separate classes

What Perceptron Cannot Do:
❌ Non-linearly separable problems: Cannot solve XOR problem

The XOR Problem:

x₁ | xβ‚‚ | Output
---|----|-------
 0 |  0 |   0
 0 |  1 |   1
 1 |  0 |   1
 1 |  1 |   0
Enter fullscreen mode Exit fullscreen mode

Visual Representation:

xβ‚‚
 1 β”‚    ●      β—‹
   β”‚
   β”‚    β—‹      ●
 0 └───────────────── x₁
   0              1
Enter fullscreen mode Exit fullscreen mode

Problem: No single straight line can separate the classes!

This limitation led to the development of:

  • Multi-layer perceptrons (hidden layers)
  • Neural networks (multiple layers)
  • Deep learning (many layers)

Key Takeaways from Step 2

βœ… Perceptron: Makes binary decisions (YES/NO)

βœ… Step Function: Converts scores to decisions

βœ… Decision Boundary: Line that separates classes

βœ… Learning Rule: How perceptron updates weights

βœ… Training Process: Iterative learning from mistakes

βœ… Limitations: Can only solve linearly separable problems


Step 3: Logistic Regression

The Big Idea

The perceptron says:

  • YES (1) or NO (0)

But real AI should say:

"I am 85% confident"

That's exactly what Logistic Regression does.

From Perceptron to Logistic Regression

Same core equation:

z = x Β· w + b
Enter fullscreen mode Exit fullscreen mode

But instead of a step function, we use Sigmoid.

Sigmoid Function (Probability Maker)

The sigmoid function converts any number into a value between 0 and 1:

import numpy as np

def sigmoid(z):
    """
    Sigmoid activation function

    Parameters:
    z: input score (any number)

    Returns:
    Probability between 0 and 1
    """
    return 1 / (1 + np.exp(-z))

# Test
test_values = [-5, -2, 0, 2, 5]
for val in test_values:
    prob = sigmoid(val)
    print(f"sigmoid({val:3d}) = {prob:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

sigmoid( -5) = 0.0067
sigmoid( -2) = 0.1192
sigmoid(  0) = 0.5000
sigmoid(  2) = 0.8808
sigmoid(  5) = 0.9933
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • np.exp(-z): Exponential function (e raised to -z power)
  • 1 / (1 + np.exp(-z)): Sigmoid formula
  • Why this works:
    • When z is large positive: exp(-z) β‰ˆ 0, so result β‰ˆ 1
    • When z is large negative: exp(-z) β‰ˆ ∞, so result β‰ˆ 0
    • When z = 0: exp(0) = 1, so result = 0.5
  • Output is always between 0 and 1 (perfect for probabilities!)

Visual Representation:

Sigmoid Function:

Probability
   1 β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
     β”‚    β•±
 0.5 │───╱
     β”‚  β•±
   0 │─╱
     └──────────────────────→ z (score)
    -5   -3  -1   0   1   3   5
Enter fullscreen mode Exit fullscreen mode

Key Properties:

  • Smooth: No sharp transitions
  • Bounded: Always between 0 and 1
  • Symmetric: Around z = 0

Forward Pass (Prediction)

# Dataset
X = np.array([1, 2, 3, 4], dtype=float)
y = np.array([0, 0, 1, 1], dtype=float)

# Initialize
w = 0.0
b = 0.0

# Forward pass
z = w * X + b
y_pred = sigmoid(z)

print("Probabilities:", y_pred)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Same calculation as before: z = w * X + b
  • But now apply sigmoid: y_pred = sigmoid(z)
  • Outputs are probabilities (0 to 1), not just 0 or 1!

Loss Function (Binary Cross-Entropy)

We use Binary Cross-Entropy Loss instead of MSE:

def binary_cross_entropy(y, y_pred):
    """
    Binary cross-entropy loss function

    Parameters:
    y: actual labels (0 or 1)
    y_pred: predicted probabilities (0 to 1)

    Returns:
    Loss value (lower is better)
    """
    return -np.mean(y * np.log(y_pred + 1e-9) + 
                    (1 - y) * np.log(1 - y_pred + 1e-9))

loss = binary_cross_entropy(y, y_pred)
print("Loss:", loss)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • y * np.log(y_pred + 1e-9): Loss when actual = 1
    • If actual = 1: This term contributes
    • If actual = 0: This term is 0
  • (1 - y) * np.log(1 - y_pred + 1e-9): Loss when actual = 0
    • If actual = 0: This term contributes
    • If actual = 1: This term is 0
  • + 1e-9: Small number to avoid log(0) = -infinity
  • np.mean(...): Average across all data points
  • -: Negate (because we want to maximize log-likelihood)

Intuition:

  • Confident & correct β†’ small loss (log of number close to 1)
  • Confident & wrong β†’ BIG loss (log of number close to 0)
  • Uncertain (0.5) β†’ medium loss

Training with Gradient Descent

lr = 0.1
w = 0.0
b = 0.0
losses = []

for epoch in range(1000):
    # Forward pass
    z = w * X + b
    y_pred = sigmoid(z)

    # Calculate gradients
    dw = np.mean((y_pred - y) * X)  # Same as linear regression!
    db = np.mean(y_pred - y)

    # Update weights
    w -= lr * dw
    b -= lr * db

    # Track loss
    loss = binary_cross_entropy(y, y_pred)
    losses.append(loss)

print(f"Final w: {w:.4f}")
print(f"Final b: {b:.4f}")
print(f"Final loss: {losses[-1]:.4f}")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Same gradient formula as linear regression!
  • But y_pred is now a probability (from sigmoid)
  • Gradient descent works the same way
  • Loss decreases over time

From Probability to Decision

After getting probabilities, we can make decisions:

threshold = 0.5
decisions = (final_probs >= threshold).astype(int)

print("Probabilities:", final_probs)
print("Decisions:", decisions)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • final_probs >= threshold: Boolean array (True/False)
  • .astype(int): Convert to 0/1
  • Important: Decision comes after probability

ROC Curve (Model Evaluation)

The ROC curve shows model performance across different thresholds:

def calculate_roc_curve(y_true, y_scores):
    """Calculate ROC curve points"""
    # Sort by scores (descending)
    sorted_indices = np.argsort(y_scores)[::-1]
    y_true_sorted = y_true[sorted_indices]
    y_scores_sorted = y_scores[sorted_indices]

    # Calculate TPR and FPR for each threshold
    thresholds = np.unique(y_scores_sorted)
    thresholds = np.append(thresholds, [1.0, 0.0])
    thresholds = np.sort(thresholds)[::-1]

    tpr = []  # True Positive Rate
    fpr = []  # False Positive Rate

    for threshold in thresholds:
        y_pred = (y_scores_sorted >= threshold).astype(int)

        tp = np.sum((y_pred == 1) & (y_true_sorted == 1))
        fp = np.sum((y_pred == 1) & (y_true_sorted == 0))
        tn = np.sum((y_pred == 0) & (y_true_sorted == 0))
        fn = np.sum((y_pred == 0) & (y_true_sorted == 1))

        tpr_val = tp / (tp + fn) if (tp + fn) > 0 else 0
        fpr_val = fp / (fp + tn) if (fp + tn) > 0 else 0

        tpr.append(tpr_val)
        fpr.append(fpr_val)

    # Calculate AUC using trapezoidal rule
    auc = np.trapz(tpr, fpr)

    return np.array(fpr), np.array(tpr), auc

# Use it
final_probs_all = sigmoid(w * X + b)
fpr, tpr, auc_score = calculate_roc_curve(y, final_probs_all)

print(f"AUC Score: {auc_score:.3f}")
print("  AUC = 1.0: Perfect classifier")
print("  AUC = 0.5: Random classifier")
print("  AUC > 0.7: Good classifier")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • TPR (True Positive Rate): How often we correctly predict positive
  • FPR (False Positive Rate): How often we incorrectly predict positive
  • AUC (Area Under Curve): Single number summarizing performance
    • AUC = 1.0: Perfect classifier
    • AUC = 0.5: Random classifier
    • AUC > 0.7: Good classifier

Why Logistic Regression is Better Than Perceptron

βœ… Smooth learning: No sharp transitions

βœ… Probability output: Expresses confidence

βœ… Stable training: Gradients are well-behaved

❌ Still only straight-line separation: Same limitation as perceptron

Key Takeaways from Step 3

βœ… Logistic Regression: Makes decisions with confidence (probabilities)

βœ… Sigmoid Function: Converts scores to probabilities

βœ… Binary Cross-Entropy: Loss function for probabilities

βœ… ROC Curve: Evaluates model performance

βœ… Probability β†’ Decision: Convert probabilities to binary decisions


Step 4: Multiple Neurons

The Big Idea

So far, we used one neuron. But real AI works like a team of neurons:

  • Each neuron looks at the data differently
  • Together they make better decisions

A Neural Network = Many neurons + Math + Repetition

From One Neuron to Many

Single neuron:

z = x Β· w + b
Enter fullscreen mode Exit fullscreen mode

Multiple neurons (layer):

Z = X Β· W + b
Enter fullscreen mode Exit fullscreen mode

Where:

  • X = input matrix (many samples)
  • W = weight matrix (many neurons)
  • b = bias vector

Dataset Example

# Features: Math score, Science score
# Target: Pass (1) or Fail (0)

X = np.array([
    [80, 70],  # Student 1
    [60, 65],  # Student 2
    [90, 95],  # Student 3
    [50, 45]   # Student 4
], dtype=float)

y = np.array([[1], [0], [1], [0]])

print("X shape:", X.shape)  # (4, 2) = 4 students, 2 features
print("y shape:", y.shape)  # (4, 1) = 4 students, 1 output
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • X: 2D matrix (4 students Γ— 2 features)
  • Each row is one student: [math_score, science_score]
  • y: Target labels (2D array for matrix compatibility)

Weight Matrix (Many Neurons)

Let's use 3 neurons in one layer:

W = np.random.randn(2, 3)  # 2 features β†’ 3 neurons
b = np.zeros((1, 3))

print("W shape:", W.shape)  # (2, 3)
print("b shape:", b.shape)  # (1, 3)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • W = np.random.randn(2, 3): Weight matrix
    • Shape (2, 3): 2 input features β†’ 3 neurons
    • Each column represents weights for one neuron
  • b = np.zeros((1, 3)): Bias vector
    • Shape (1, 3): One bias per neuron (3 biases total)

Forward Pass (Matrix Multiplication)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Forward pass
Z = X @ W + b
A = sigmoid(Z)

print("Z shape:", Z.shape)  # (4, 3) = 4 students, 3 neuron scores
print("A (outputs):\n", A)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Z = X @ W + b: Matrix multiplication
    • X @ W: (4Γ—2) @ (2Γ—3) = (4Γ—3)
    • Each row of X Γ— weight matrix = scores for 3 neurons
    • Result: 4 students Γ— 3 neuron scores
    • + b: Broadcasting adds bias to each neuron
  • A = sigmoid(Z): Apply activation function
    • Converts scores to activations (probabilities)
    • Applied element-wise to entire matrix
    • A shape: (4, 3) = 4 students, 3 neuron activations

Key insight: Each column in A = output of one neuron across all students.

Single Output Neuron (Combining the Layer)

Now we combine the neuron layer into one output neuron:

W_out = np.random.randn(3, 1)  # 3 hidden neurons β†’ 1 output
b_out = np.zeros((1, 1))

Z_out = A @ W_out + b_out
y_pred = sigmoid(Z_out)

print("Final output probabilities:\n", y_pred)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • W_out = np.random.randn(3, 1): Output layer weights
    • Shape (3, 1): 3 hidden neurons β†’ 1 output neuron
  • Z_out = A @ W_out + b_out: Calculate final scores
    • A @ W_out: (4Γ—3) @ (3Γ—1) = (4Γ—1)
    • Each student's 3 neuron activations β†’ 1 final score
  • y_pred = sigmoid(Z_out): Convert to probabilities
    • Final predictions as probabilities (0 to 1)

Network flow: Input (4Γ—2) β†’ Hidden (4Γ—3) β†’ Output (4Γ—1)

Training the Network

lr = 0.1
losses = []

for epoch in range(1000):
    # Forward
    Z = X @ W + b
    A = sigmoid(Z)

    Z_out = A @ W_out + b_out
    y_pred = sigmoid(Z_out)

    # Loss
    loss = binary_cross_entropy(y, y_pred)
    losses.append(loss)

    # Backpropagation (simplified)
    dZ_out = y_pred - y
    dW_out = A.T @ dZ_out
    db_out = np.mean(dZ_out, axis=0)

    dA = dZ_out @ W_out.T
    dZ = dA * A * (1 - A)  # Sigmoid derivative
    dW = X.T @ dZ
    db = np.mean(dZ, axis=0)

    # Update
    W_out -= lr * dW_out
    b_out -= lr * db_out
    W -= lr * dW
    b -= lr * db
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Forward Pass: Calculate predictions through all layers
  • Backpropagation: Calculate gradients for all layers
    • Start from output layer and work backwards
    • Use chain rule from calculus
  • Update: Adjust all weights and biases
  • Key: Matrix operations make this efficient!

Why This Matters

βœ… First real neural network: Multiple neurons working together

βœ… Multiple neurons learn different patterns: Each neuron specializes

βœ… Matrix math = speed + power: Process all data at once

❌ Still limited to simple patterns: Need hidden layers for complex problems

Key Takeaways from Step 4

βœ… Neural Network Layer: Multiple neurons working together

βœ… Matrix Operations: Efficient processing of multiple samples

βœ… Forward Pass: Data flows through network

βœ… Backpropagation: Gradients flow backwards

βœ… Multi-layer Networks: Combine layers for more power


Step 5: Hidden Layers & XOR

The Big Idea (The WOW Moment)

Some problems cannot be solved with a single straight line.

This famous problem is called XOR.

If perceptron fails ❌ and a deeper network succeeds βœ…,

then depth really matters.

The XOR Problem

XOR truth table:

x₁ xβ‚‚ XOR
0 0 0
0 1 1
1 0 1
1 1 0
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
], dtype=float)

y = np.array([[0], [1], [1], [0]])
Enter fullscreen mode Exit fullscreen mode

Visual Representation:

xβ‚‚
 1 β”‚    ●      β—‹
   β”‚
   β”‚    β—‹      ●
 0 └───────────────── x₁
   0              1
Enter fullscreen mode Exit fullscreen mode

Problem: No single straight line can separate the classes!

Try a Single-Layer Model (It Will Fail)

W = np.random.randn(2, 1)
b = np.zeros((1,))

lr = 0.1
losses = []

for epoch in range(2000):
    z = X @ W + b
    y_pred = sigmoid(z)

    loss = binary_cross_entropy(y, y_pred)
    losses.append(loss)

    dW = X.T @ (y_pred - y)
    db = np.mean(y_pred - y)

    W -= lr * dW
    b -= lr * db

print("Single-layer predictions:", (y_pred >= 0.5).astype(int))
Enter fullscreen mode Exit fullscreen mode

❌ The model cannot solve XOR. Loss never reaches zero.

Adding a Hidden Layer (Deep Learning)

Now we add one hidden layer with non-linearity.

Architecture:

  • Input layer (2 neurons)
  • Hidden layer (4 neurons)
  • Output layer (1 neuron)

Initialize the Network

W1 = np.random.randn(2, 4)  # Input β†’ Hidden
b1 = np.zeros((1, 4))

W2 = np.random.randn(4, 1)  # Hidden β†’ Output
b2 = np.zeros((1, 1))

lr = 0.1
losses = []
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • W1 = np.random.randn(2, 4): First layer weights (input β†’ hidden)
    • Shape (2, 4): 2 input features β†’ 4 hidden neurons
  • b1 = np.zeros((1, 4)): First layer biases
    • Shape (1, 4): One bias per hidden neuron
  • W2 = np.random.randn(4, 1): Second layer weights (hidden β†’ output)
    • Shape (4, 1): 4 hidden neurons β†’ 1 output neuron
  • b2 = np.zeros((1, 1)): Second layer bias
    • Shape (1, 1): Single bias for output neuron

Architecture: Input(2) β†’ Hidden(4) β†’ Output(1)

Forward Pass

def forward(X):
    Z1 = X @ W1 + b1
    A1 = sigmoid(Z1)  # Hidden layer activation

    Z2 = A1 @ W2 + b2
    A2 = sigmoid(Z2)  # Output layer activation

    return A1, A2
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • First Layer: Z1 = X @ W1 + b1, then A1 = sigmoid(Z1)
    • Input (4Γ—2) β†’ Hidden (4Γ—4)
  • Second Layer: Z2 = A1 @ W2 + b2, then A2 = sigmoid(Z2)
    • Hidden (4Γ—4) β†’ Output (4Γ—1)
  • Non-linearity: Sigmoid in hidden layer is crucial!

Training with Backpropagation

for epoch in range(5000):
    A1, y_pred = forward(X)

    loss = binary_cross_entropy(y, y_pred)
    losses.append(loss)

    # Backpropagation
    # Output layer gradients
    dZ2 = y_pred - y
    dW2 = A1.T @ dZ2
    db2 = np.mean(dZ2, axis=0)

    # Hidden layer gradients
    dA1 = dZ2 @ W2.T
    dZ1 = dA1 * A1 * (1 - A1)  # Sigmoid derivative
    dW1 = X.T @ dZ1
    db1 = np.mean(dZ1, axis=0)

    # Update weights
    W2 -= lr * dW2
    b2 -= lr * db2
    W1 -= lr * dW1
    b1 -= lr * db1
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Backpropagation: Calculate gradients backwards through network
    • Start from output layer: dZ2 = y_pred - y
    • Propagate to hidden layer: dA1 = dZ2 @ W2.T
    • Apply sigmoid derivative: dZ1 = dA1 * A1 * (1 - A1)
  • Update: Adjust all weights and biases
  • Key: Hidden layer allows network to learn non-linear patterns!

Final Predictions (SUCCESS πŸŽ‰)

_, final_probs = forward(X)
final_preds = (final_probs >= 0.5).astype(int)

print("Final probabilities:\n", final_probs)
print("Final predictions:\n", final_preds)
print("Actual values:\n", y)
Enter fullscreen mode Exit fullscreen mode

βœ… The network solves XOR!

Expected Output:

Final probabilities:
[[0.01]
 [0.99]
 [0.99]
 [0.01]]

Final predictions:
[[0]
 [1]
 [1]
 [0]]

Actual values:
[[0]
 [1]
 [1]
 [0]]
Enter fullscreen mode Exit fullscreen mode

Perfect match! πŸŽ‰

What Just Happened?

  • Hidden neurons learned intermediate patterns: Each hidden neuron detects a different feature
  • Network combined them to solve XOR: Output neuron combines hidden neuron outputs
  • This is the foundation of deep learning: Depth creates new feature spaces

🧠 Key insight:

Depth creates new feature spaces. Hidden layers transform the input into a space where the problem becomes linearly separable.

Understanding Overfitting

Overfitting occurs when a model learns training data too well and fails to generalize:

# Signs of overfitting:
# - Training loss continues decreasing
# - Validation loss decreases then increases
# - Model memorizes training data

# Solutions:
# 1. Early stopping: Stop when validation loss stops improving
# 2. Regularization: Add penalty for large weights (L1/L2)
# 3. Dropout: Randomly disable neurons during training
# 4. More data: Collect more training examples
# 5. Simpler model: Reduce model complexity
Enter fullscreen mode Exit fullscreen mode

Why Step 5 Is Critical

βœ… Explains why deep learning exists

βœ… Shows limits of shallow models

βœ… Makes hidden layers intuitive

βœ… Creates a lasting "Aha!" moment

Key Takeaways from Step 5

βœ… XOR Problem: Cannot be solved with single layer

βœ… Hidden Layers: Enable non-linear pattern learning

βœ… Backpropagation: Gradients flow backwards through network

βœ… Deep Learning: Depth creates new feature spaces

βœ… Overfitting: Model can memorize instead of generalize


Step 6: PyTorch

The Big Idea

So far, you built everything by hand:

  • Weights
  • Forward pass
  • Backpropagation
  • Updates

Real AI engineers use frameworks like PyTorch to:

  • Write less code
  • Avoid bugs
  • Train large models faster

🧠 Important:

PyTorch does NOT replace understandingβ€”it automates math you already know.

Installing PyTorch

pip install torch torchvision torchaudio
Enter fullscreen mode Exit fullscreen mode

First PyTorch Tensor

A tensor is like a NumPy array, but smarter:

import torch

x = torch.tensor([1.0, 2.0, 3.0])
print(x)
print(type(x))  # <class 'torch.Tensor'>
Enter fullscreen mode Exit fullscreen mode

Tensor vs NumPy:

NumPy PyTorch
array tensor
CPU only CPU / GPU
manual gradients automatic gradients

Autograd (Automatic Gradients)

PyTorch can calculate gradients automatically:

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x

y.backward()

print("dy/dx =", x.grad)  # dy/dx = 7.0
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • requires_grad=True: Tell PyTorch to track gradients
  • y.backward(): Calculate gradients automatically
  • x.grad: Access the gradient
  • This replaces manual derivative calculations!

Dataset Example (XOR Again)

We reuse XOR to prove PyTorch works:

X = torch.tensor([[0., 0.],
                  [0., 1.],
                  [1., 0.],
                  [1., 1.]])

y = torch.tensor([[0.],
                  [1.],
                  [1.],
                  [0.]])
Enter fullscreen mode Exit fullscreen mode

Defining a Neural Network

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(2, 4),   # input β†’ hidden (2 features β†’ 4 neurons)
    nn.ReLU(),         # activation function
    nn.Linear(4, 1),   # hidden β†’ output (4 neurons β†’ 1 output)
    nn.Sigmoid()       # probability output
)

print(model)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • nn.Sequential: Layers stacked sequentially
  • nn.Linear(2, 4): Linear layer (weights + bias)
    • 2 input features β†’ 4 output neurons
    • Equivalent to: W @ X + b
  • nn.ReLU(): Rectified Linear Unit activation
    • ReLU(x) = max(0, x)
    • Faster than sigmoid, helps with deep networks
  • nn.Linear(4, 1): Output layer
    • 4 hidden neurons β†’ 1 output
  • nn.Sigmoid(): Convert to probability

🧠 Connection to previous steps:

  • Linear = weights + bias (from Steps 0-5)
  • ReLU/Sigmoid = activation (from Steps 3-5)
  • Layers = matrices (from Step 4)

Loss Function & Optimizer

loss_fn = nn.BCELoss()  # Binary Cross-Entropy Loss
optimizer = torch.optim.SGD(
    model.parameters(),  # All weights and biases
    lr=0.1                # Learning rate
)
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • nn.BCELoss(): Binary Cross-Entropy Loss (from Step 3)
  • torch.optim.SGD: Stochastic Gradient Descent optimizer
    • model.parameters(): All weights and biases in the model
    • lr=0.1: Learning rate
Concept PyTorch
Loss nn.BCELoss()
Gradient Descent optim.SGD
Weights model.parameters()

Training Loop (Very Important)

losses = []

for epoch in range(3000):
    # Forward pass
    y_pred = model(X)
    loss = loss_fn(y_pred, y)
    losses.append(loss.item())

    # Backward pass
    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()        # Calculate gradients
    optimizer.step()       # Update weights

print(f"Final loss: {losses[-1]:.4f}")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • Forward Pass: y_pred = model(X)
    • Data flows through all layers automatically
  • Loss Calculation: loss = loss_fn(y_pred, y)
    • Compare predictions with actual values
  • Backward Pass:
    • optimizer.zero_grad(): Clear gradients from previous iteration
    • loss.backward(): Calculate gradients automatically (autograd!)
    • optimizer.step(): Update all weights and biases
  • This loop replaces everything you coded manually before!

Learning Curve

import matplotlib.pyplot as plt

plt.plot(losses)
plt.title("Training Loss (PyTorch)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True, alpha=0.3)
plt.show()
Enter fullscreen mode Exit fullscreen mode

βœ… Loss decreases β†’ model is learning.

Final Predictions

with torch.no_grad():  # Disable gradient computation (faster)
    probs = model(X)
    preds = (probs >= 0.5).int()

print("Probabilities:\n", probs)
print("Predictions:\n", preds)
print("Actual:\n", y.int())
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • torch.no_grad(): Disable gradient tracking (faster for inference)
  • model(X): Make predictions
  • (probs >= 0.5).int(): Convert probabilities to binary decisions

πŸŽ‰ PyTorch solved XOR successfully!

Model Saving and Loading

Saving a Model:

model_path = "xor_model.pth"
torch.save({
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': losses[-1],
}, model_path)

print(f"βœ… Model saved to {model_path}")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • model.state_dict(): Dictionary containing all model weights
  • optimizer.state_dict(): Optimizer state (learning rate, momentum, etc.)
  • torch.save(): Save to file

Loading a Model:

# Create a new model instance (untrained)
new_model = nn.Sequential(
    nn.Linear(2, 4),
    nn.ReLU(),
    nn.Linear(4, 1),
    nn.Sigmoid()
)

# Load the saved model
checkpoint = torch.load(model_path)
new_model.load_state_dict(checkpoint['model_state_dict'])
print("βœ… Model loaded successfully!")

# Test loaded model
with torch.no_grad():
    loaded_probs = new_model(X)
    loaded_preds = (loaded_probs >= 0.5).int()
    loaded_accuracy = (loaded_preds == y.int()).float().mean().item()

print(f"Loaded model accuracy: {loaded_accuracy:.2%}")
Enter fullscreen mode Exit fullscreen mode

Code Explanation:

  • torch.load(): Load checkpoint from file
  • checkpoint['model_state_dict']: Extract model weights
  • new_model.load_state_dict(): Load weights into model
  • Model now has same weights as saved model!

Comparing Scratch vs PyTorch

From Scratch PyTorch
Manual gradients Automatic
More code Less code
Easy to make mistakes Safer
Best for learning Best for real projects

🧠 Rule:

Learn from scratch β†’ build with PyTorch.

Key Takeaways from Step 6

βœ… PyTorch: Professional AI framework

βœ… Tensors: Like NumPy arrays but with gradients

βœ… Autograd: Automatic gradient calculation

βœ… nn.Module: Define neural networks easily

βœ… Training Loop: Forward β†’ Loss β†’ Backward β†’ Update

βœ… Model Saving: Save and load trained models


The Complete Picture

The Journey So Far

  1. Step 0: Math foundations (vectors, weights, dot product, bias)
  2. Step 1: Linear Regression (predicting numbers, gradient descent)
  3. Step 2: Perceptron (binary decisions, step function)
  4. Step 3: Logistic Regression (probabilistic decisions, sigmoid)
  5. Step 4: Multiple Neurons (neural network layers, matrix operations)
  6. Step 5: Hidden Layers (deep learning, XOR problem, backpropagation)
  7. Step 6: PyTorch (professional framework, autograd, training loops)

How Everything Connects

Step 0: Math Foundations
    ↓
Step 1: Linear Regression (predict numbers)
    ↓
Step 2: Perceptron (make decisions)
    ↓
Step 3: Logistic Regression (decisions with confidence)
    ↓
Step 4: Multiple Neurons (neural network layers)
    ↓
Step 5: Hidden Layers (deep learning, solve XOR)
    ↓
Step 6: PyTorch (professional tools)
Enter fullscreen mode Exit fullscreen mode

Core Concepts You've Mastered

  1. Mathematical Foundations

    • Vectors, matrices, dot products
    • Weights, bias, neurons
    • Matrix operations
  2. Learning Algorithms

    • Gradient descent
    • Backpropagation
    • Weight updates
  3. Neural Networks

    • Single neurons
    • Multiple neurons (layers)
    • Hidden layers (deep networks)
    • Forward and backward passes
  4. Loss Functions

    • Mean Squared Error (MSE)
    • Binary Cross-Entropy
    • When to use each
  5. Activation Functions

    • Step function (perceptron)
    • Sigmoid (logistic regression)
    • ReLU (deep networks)
  6. Professional Tools

    • PyTorch framework
    • Automatic gradients
    • Model saving/loading

What You Can Build Now

βœ… Linear Regression Models: Predict numbers from data

βœ… Binary Classifiers: Make YES/NO decisions

βœ… Probabilistic Classifiers: Express confidence in decisions

βœ… Neural Networks: Multi-layer networks with hidden layers

βœ… Deep Learning Models: Solve non-linear problems like XOR

βœ… PyTorch Models: Professional AI applications


Key Takeaways

Mathematical Understanding

You now understand:

  • How AI represents data (vectors, matrices)
  • How AI combines information (dot product, matrix multiplication)
  • How AI makes adjustments (bias, weights)
  • How AI learns (gradient descent, backpropagation)

Coding Skills

You can now:

  • Build neural networks from scratch
  • Use PyTorch for professional development
  • Understand and modify existing AI code
  • Debug training issues
  • Save and load models

Conceptual Mastery

You understand:

  • Why deep learning exists (XOR problem)
  • How neural networks learn (gradient descent)
  • When to use different activation functions
  • How to evaluate models (loss, accuracy, ROC curves)
  • The difference between training and inference

Next Steps

After completing Steps 0-6, you're ready for:

  • Step 7: Recurrent Neural Networks (RNNs) for sequences
  • Step 8: Convolutional Neural Networks (CNNs) for images
  • Advanced Topics: Transformers, GANs, Reinforcement Learning
  • Real Projects: Build your own AI applications

Conclusion

Congratulations! You've completed a comprehensive journey from basic mathematical concepts to building neural networks with PyTorch.

What makes this journey special:

  • No black boxes: You understand every concept from first principles
  • Progressive learning: Each step builds naturally on the previous
  • Code + Concepts: Both implementation and theory are covered
  • Real understanding: You know how AI works, not just how to use it

Remember:

  • Every expert was once a beginner
  • Understanding > memorization
  • Practice makes perfect
  • Keep building projects!

πŸš€ You are now ready to build real AI applications!


This article covers Steps 0-6 of the AI Bootcamp. For advanced topics (RNNs, CNNs, Transformers, etc.), continue with Steps 7-13.

Top comments (0)