DEV Community: mahraib

forward propogation - day 05 of dl

mahraib — Mon, 19 Jan 2026 17:14:01 +0000

while reading & learning about FP, i came across many definition and found this simple and exact definition.

Forward propagation is where input data is fed through a network, in a forward direction, to generate an output. The data is accepted by hidden layers and processed, as per the activation function, and moves to the successive layer. The forward flow of data is designed to avoid data moving in a circular motion, which does not generate an output.

reference

math represtentation:

-> for l layers:

Z⁽ˡ⁾ = W⁽ˡ⁾ · A⁽ˡ⁻¹⁾ + b⁽ˡ⁾
A⁽ˡ⁾ = f(Z⁽ˡ⁾)

-> for one neuron porward pass:

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
a = f(z)

where f is activation function.

here is simple code for forward propogation, with 2 layers + output layer

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def relu(x):
    return np.maximum(0, x)

def forward_propagation(X, parameters, activation='relu'):

    #layer 1
    Z1 = np.dot(parameters['W1'], X) + parameters['b1']
    A1 = relu(Z1)

    #layer 2  
    Z2 = np.dot(parameters['W2'], A1) + parameters['b2']
    A2 = relu(Z2)

    #output layer
    Z3 = np.dot(parameters['W3'], A2) + parameters['b3']
    A3 = sigmoid(Z3)

    cache = {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2, 'Z3': Z3, 'A3': A3}

    return A3, cache

benefit of forward propogation

easy computation: just matrix multiplications and element-wise operations.

why need back propogation?

no learning: bcz forward propogation only computes predictions, doesn't update weights.
forward pass only: forward propogation doesn't tell us how wrong we are. to determine the how bad the prediction of NN is, we need to compute loss functions & update the weights.

Vanishing gradient & dying relu - day 04 of dl

mahraib — Fri, 16 Jan 2026 18:31:09 +0000

yesterday while learning about activation functions. we came across 2 distinguished terms.

vashing gradient
dying relu

here is a short summry of these.

vanishing gradient

vanishing gradient is a problem that happens during training in deep neural networks, especially those using activation functions like sigmoid or tanh.
what happens?

during backpropagation, gradients (derivatives) are calculated and passed backward through the network. these gradients tell the model how much to adjust each weight to reduce error.

with certain activation functions, the gradient can become extremely small close to zero, as it gets multiplied layer by layer.

if the gradient becomes too small, the weights in earlier layers receive almost no update, so they stop learning.

why does it happen?

for example, the derivative of sigmoid is:

def sigmoid_derivative(sig):

    return sig * (1 - sig)

since sigma(x) is between 0 and 1, the derivative is between 0 and 0.25.
if you multiply many small numbers (like 0.1*0.1*0.1...), the result approaches zero very quickly.

tanh has a similar problem: its derivative is between 0 and 1, but for large inputs it also saturates and gives near-zero gradients.

the result:

· early layers learn very slowly or not at all.

· deep networks become hard or impossible to train.

how is it solved?

modern activation functions like relu help because:

· for x > 0, derivative is exactly 1, so gradients don’t shrink
· no saturation in the positive region

but relu introduces its own problem: dying relu, where neurons can get stuck at zero and also stop learning.
variants like leaky relu, elu, and gelu try to fix this while keeping gradients learning.

dying relu

dying relu is a problem that happens when neurons using the relu activation function become permanently "dead", meaning they stop firing or outputting zero for all inputs and never recover.

what happens?

a relu neuron outputs:

relu(x) = max(0, x)

this means:

· if the weighted sum x is positive → output = x
· if x is negative → output = 0

the derivative for:

· x > 0 → 1
· x < 0 → 0

how do neurons die?

during training, if a neuron's weighted sum becomes negative for all training examples, its gradient becomes 0 (because derivative is 0 for negative inputs).

once the gradient is 0, the weights won’t update → the neuron stays "off" forever → it's dead.

this is especially common if:

· learning rate is too high.

· large weight updates push the neuron into negative territory permanently.

· bad weight initialization.

why is it a problem?

dead neurons don’t contribute to learning, they’re wasted parameters. too many dead neurons can reduce the network’s capacity and slow learning.

how to fix it?

use variants of relu that allow a small gradient for negative inputs:

leaky relu:

def leaky_relu(x, alpha=0.01):
    return np.maximum(alpha * x, x)

→ small slope (alpha) for negatives, so gradient never fully dies.

parametric relu (prelu):
like leaky relu, but alpha is learned.

elu:

def elu(x, alpha=1.0):
    return np.where(x >= 0, x, alpha * (np.exp(x) - 1))

smooth for negatives, helps mean activations stay closer to zero.

if you came and read this far.. thanks for reading. ✨

activation functions - day 03 of dl

mahraib — Thu, 15 Jan 2026 18:34:18 +0000

activation functions

a neural network without an activation function is just a giant linear regression model, no matter how many layers. an activation function is a non-linear transformation applied to input weights.

sigmoid (logistic function)

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

range is (0, 1). perfect for binary classification.

flaws:

vanishing gradient.
computationally expensive (exponential calculation).

when to use today:

never in hidden layer.
use in output layer for classification problems.

side note: dont surprise by the formulas representation or think it's AI generated. i have pretty good experience in latex/math pdf editing.

hyperbolic tangent (tanh)

def tanh(x):
    return (math.exp(x)-math.exp(-x))/(math.exp(x)+math.exp(-x))

range is (-1, 1).

betterment: zero-centered, leading to faster convergence.

flaw:

vanishing gradient with inputs of large magnitude.

when to use:

sometimes in hidden layers of rnns/lstms.

rectified linear unit (relu)

def relu(x):
    return np.maximum(0, x)

range is [0, ∞).

betterment:

solved vanishing gradient problem: derivative is 1 for x > 0, so gradient flows freely.
computation is cheap.

flaw:

dying relu.

when to use: 90% used in hidden layers. if it works, don't touch it.

leaky relu

def leaky_relu(x, alpha=0.01):
    return np.maximum(alpha * x, x)

betterment: provides a small, non-zero step for negative inputs, allowing neurons to recover.

when to use: if the "dying relu" problem occurs (check activation stats).

exponential linear unit (elu)

def elu(x, alpha=1.0):
    return np.where(x >= 0, x, alpha * (np.exp(x) - 1))

gaussian error linear unit (gelu)

def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

betterment: default sota for transformers.

side note: i read my second research paper, but it was the first one i read from a learning perspective, so i'm happy about it. this formula was also copied from a research paper.

swish (from google brain)

def swish(x, beta=1.0):
    return x * sigmoid(beta * x)

beta is often 1.

when to use: good alternative for relu, used for cnn tasks.

here is the link of github md file which define these formulas.

hidden layer - day 02 of dl

mahraib — Tue, 13 Jan 2026 16:21:23 +0000

a hidden layer is an intermediate layer between the input and output layers in a neural network. it's called "hidden" because its outputs are not directly observable as final outputs from the network.

key points:

1. transformation function:

each hidden layer performs:

linear transformation: z = w·x + b (weights × inputs + bias)

matrix representation:
for a hidden layer with m inputs and n neurons:

     hidden layer output = activation(w·x + b)
    where:
      w = weight matrix of shape (n × m)
        x = input vector of shape (m × 1)
          b = bias vector of shape (n × 1)

non-linear activation: a = f(z) (relu, sigmoid, tanh, etc.)
impact:
- sigmoid/tanh: early days, suffers from vanishing gradient.
- relu: modern default, solves vanishing gradient but has "dying relu" problem.
- leaky relu/elu: address dying relu issue.
- swish/mish: recent alternatives, often better performance.

activation functions will be discuss in details.

2. what happens in a hidden layer:

feature extraction: learns patterns from previous layer's outputs.
hierarchical learning: early layers learn simple features, deeper layers combine them.

3. why are hidden layers so important?

example: cat image classification

layer	what it "sees"
input	raw pixels
hidden 1	edge detectors
hidden 2	texture patterns
hidden 3	object parts
hidden 4	whole objects
output	classification

the "deep" in deep learning:

the term "deep" in deep learning specifically refers to having multiple hidden layers. this depth enables:

automatic feature engineering: no need for manual feature extraction.
hierarchical understanding: from pixels to concepts.
transfer learning: early layers often learn general features transferable between tasks.

the takeaway:

hidden layers are learned feature extractors.the depth (number of hidden layers) and architecture of these layers determine what kind of patterns the network can learn and how well it can learn them.

without hidden layers, neural networks would be just linear regression. with them, they can learn the complex patterns that power modern ai applications.

perceptron - day 01 of dl

mahraib — Mon, 12 Jan 2026 19:36:55 +0000

while starting learning neural networks, perceptron is the first thing. it's simple and shows how learning from points works.

how it works

a perceptron draws a straight line to separate two types of data. it calculates:

output = w1*x1 + w2*x2 + ... + b

if output is positive, it says "class a".
if negative, "class b".

to learn, it uses this trick:

start with random weights.
check one point.
if wrong, adjust weights toward that point.
repeat until all points are right.

the update looks like this:
new weight = old weight + learning rate * (true label - predicted label) * input

simple idea: if you're wrong, move the line toward the mistake.

the problem

the perceptron stops as soon as all training points are correct. but there are often many possible lines that all work perfectly.

imagine separating two groups of points. you could draw the line close to one group, close to the other, or in the middle. all would be 100% correct on your training data.

the perceptron picks whichever line it finds first, it would be line A, B or C. which one you, get depends on:

random starting weights.
the order of points.
luck.

it has one big flaw: it finds any solution that works, not the best one.
train twice, get two different lines. both work on your training data, but one might be much better than the other.

why this matters

a line that just barely separates the data is fragile. real data has noise. new points won't be exactly like your training points. a tight boundary will make mistakes easily.

what we want is the line in the middle of the gap, farthest from both groups. this is more robust and handles new data better.

how loss functions help

loss functions change the question. instead of "is this wrong?" they ask "how wrong is this?" or "how confidently right is this?"

look at hinge loss:
loss = max(0, 1 - true label * prediction)

even if a point is correct, there's still loss if the prediction isn't confident enough. this pushes the line away from points, creating a safety margin.

gradient descent: better learning

with loss functions, we don't update based on single points. we look at all data and find the average error. then we adjust weights to reduce this error most effectively.

this is gradient descent:
new weight = old weight - learning rate * slope of loss

the minus sign is key: we go downhill toward lower error.

the takeaway

the perceptron shows the basics of learning. but it sees the world as binary: right or wrong.

real problems need more nuance. loss functions provide that. they let us:

work with data that can't be perfectly separated.
measure degrees of wrongness.
build robust classifiers.
handle multiple classes.

that's why modern neural networks use loss functions with gradient descent. it turns a simple rule follower into a true learner that handles real world complexity.

day 0 of deep learning

mahraib — Sat, 10 Jan 2026 16:53:50 +0000

My name is Mahraib Fatima. I am a final year student who loves building and learning new things.
I have learned machine learning and basic web dev already in my 3rd year of bachelor, self learning u know. Now, my goal is to learn deep learning in depth.

Here, i am going to share my journey for next few months until i get enough confidence in my deep learning knowledge.

This journey will include in depth study, mini projects and 3 main projects.

Thanks for reading.
Let's connect: http://mahraib.works