<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: mahraib</title>
    <description>The latest articles on DEV Community by mahraib (@mahraib_fatima).</description>
    <link>https://dev.to/mahraib_fatima</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3072430%2F05eb3870-f200-4570-9244-765805f3fe17.jpg</url>
      <title>DEV Community: mahraib</title>
      <link>https://dev.to/mahraib_fatima</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mahraib_fatima"/>
    <language>en</language>
    <item>
      <title>forward propogation - day 05 of dl</title>
      <dc:creator>mahraib</dc:creator>
      <pubDate>Mon, 19 Jan 2026 17:14:01 +0000</pubDate>
      <link>https://dev.to/mahraib_fatima/forward-propogation-1fn4</link>
      <guid>https://dev.to/mahraib_fatima/forward-propogation-1fn4</guid>
      <description>&lt;p&gt;while reading &amp;amp; learning about FP, i came across many definition and found this simple and exact definition.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Forward propagation is where input data is fed through a network, in a forward direction, to generate an output. The data is accepted by hidden layers and processed, as per the activation function, and moves to the successive layer. The forward flow of data is designed to avoid data moving in a circular motion, which does not generate an output. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://h2o.ai/wiki/forward-propagation/" rel="noopener noreferrer"&gt;reference&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  math represtentation:
&lt;/h3&gt;

&lt;p&gt;-&amp;gt; for &lt;code&gt;l&lt;/code&gt; layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Z⁽ˡ⁾ = W⁽ˡ⁾ · A⁽ˡ⁻¹⁾ + b⁽ˡ⁾
A⁽ˡ⁾ = f(Z⁽ˡ⁾)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;-&amp;gt; for one neuron porward pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
a = f(z)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;where &lt;code&gt;f&lt;/code&gt; is activation function.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxqpte03sr1fdsyqyvnpc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxqpte03sr1fdsyqyvnpc.png" alt=" " width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;here is simple code for forward propogation, with 2 layers + output layer&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward_propagation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;relu&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="c1"&gt;#layer 1
&lt;/span&gt;    &lt;span class="n"&gt;Z1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;W1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;b1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;A1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Z1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;#layer 2  
&lt;/span&gt;    &lt;span class="n"&gt;Z2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;W2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;A1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;b2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;A2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Z2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;#output layer
&lt;/span&gt;    &lt;span class="n"&gt;Z3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;W3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;A2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;b3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;A3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Z3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Z1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Z1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;A1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Z2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Z2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;A2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Z3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Z3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;A3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;A3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  benefit of forward propogation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;easy computation: just matrix multiplications and element-wise operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  why need back propogation?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;no learning: bcz forward propogation only computes predictions, doesn't update weights.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;forward pass only: forward propogation doesn't tell us how wrong we are. to determine the how bad the prediction of NN is, we need to compute loss functions &amp;amp; update the weights.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Vanishing gradient &amp; dying relu - day 04 of dl</title>
      <dc:creator>mahraib</dc:creator>
      <pubDate>Fri, 16 Jan 2026 18:31:09 +0000</pubDate>
      <link>https://dev.to/mahraib_fatima/vanishing-gradient-dying-relu-day-04-of-dl-9j5</link>
      <guid>https://dev.to/mahraib_fatima/vanishing-gradient-dying-relu-day-04-of-dl-9j5</guid>
      <description>&lt;p&gt;yesterday while learning about activation functions. we came across 2 distinguished terms. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;vashing gradient&lt;/li&gt;
&lt;li&gt;dying relu&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;here is a short summry of these. &lt;/p&gt;

&lt;h3&gt;
  
  
  vanishing gradient
&lt;/h3&gt;

&lt;p&gt;vanishing gradient is a problem that happens during training in deep neural networks, especially those using activation functions like &lt;code&gt;sigmoid&lt;/code&gt; or &lt;code&gt;tanh&lt;/code&gt;.&lt;br&gt;
what happens?&lt;/p&gt;

&lt;p&gt;during &lt;code&gt;backpropagation&lt;/code&gt;, gradients (derivatives) are calculated and passed backward through the network. these gradients tell the model how much to adjust each weight to reduce error.&lt;/p&gt;

&lt;p&gt;with certain activation functions, the gradient can become extremely small close to zero, as it gets multiplied layer by layer.&lt;/p&gt;

&lt;p&gt;if the gradient becomes too small, the weights in earlier layers receive almost no update, so they stop learning.&lt;/p&gt;



&lt;p&gt;why does it happen?&lt;/p&gt;

&lt;p&gt;for example, the derivative of &lt;code&gt;sigmoid&lt;/code&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sigmoid_derivative&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sig&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;since &lt;code&gt;sigma(x)&lt;/code&gt; is between 0 and 1, the derivative is between 0 and 0.25.&lt;br&gt;
if you multiply many small numbers (like &lt;code&gt;0.1*0.1*0.1...&lt;/code&gt;), the result approaches zero very quickly.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tanh&lt;/code&gt; has a similar problem: its derivative is between 0 and 1, but for large inputs it also saturates and gives near-zero gradients.&lt;/p&gt;



&lt;p&gt;the result:&lt;/p&gt;

&lt;p&gt;· early layers learn very slowly or not at all. &lt;/p&gt;

&lt;p&gt;· deep networks become hard or impossible to train. &lt;/p&gt;



&lt;p&gt;how is it solved?&lt;/p&gt;

&lt;p&gt;modern activation functions like &lt;code&gt;relu&lt;/code&gt; help because:&lt;/p&gt;

&lt;p&gt;· for x &amp;gt; 0, derivative is exactly 1, so gradients don’t shrink&lt;br&gt;
· no saturation in the positive region&lt;/p&gt;

&lt;p&gt;but &lt;code&gt;relu&lt;/code&gt; introduces its own problem: &lt;code&gt;dying relu&lt;/code&gt;, where neurons can get stuck at zero and also stop learning.&lt;br&gt;
variants like &lt;code&gt;leaky relu&lt;/code&gt;, &lt;code&gt;elu&lt;/code&gt;, and &lt;code&gt;gelu&lt;/code&gt; try to fix this while keeping gradients learning.&lt;/p&gt;


&lt;h3&gt;
  
  
  dying relu
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;dying relu&lt;/code&gt; is a problem that happens when neurons using the &lt;code&gt;relu&lt;/code&gt; activation function become permanently "dead", meaning they stop firing or outputting zero for all inputs and never recover.&lt;/p&gt;



&lt;p&gt;what happens?&lt;/p&gt;

&lt;p&gt;a &lt;code&gt;relu&lt;/code&gt; neuron outputs:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;relu(x) = max(0, x)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;this means:&lt;/p&gt;

&lt;p&gt;· if the weighted sum  x  is positive → output =  x &lt;br&gt;
· if  x  is negative → output = 0&lt;/p&gt;

&lt;p&gt;the derivative for:&lt;/p&gt;

&lt;p&gt;·  x &amp;gt; 0  → 1&lt;br&gt;
·  x &amp;lt; 0  → 0&lt;/p&gt;



&lt;p&gt;how do neurons die?&lt;/p&gt;

&lt;p&gt;during training, if a neuron's weighted sum becomes negative for all training examples, its gradient becomes 0 (because derivative is 0 for negative inputs).&lt;/p&gt;

&lt;p&gt;once the gradient is 0, the weights won’t update → the neuron stays "off" forever → it's dead.&lt;/p&gt;

&lt;p&gt;this is especially common if:&lt;/p&gt;

&lt;p&gt;· learning rate is too high. &lt;/p&gt;

&lt;p&gt;· large weight updates push the neuron into negative territory permanently. &lt;/p&gt;

&lt;p&gt;· bad weight initialization. &lt;/p&gt;



&lt;p&gt;why is it a problem?&lt;/p&gt;

&lt;p&gt;dead neurons don’t contribute to learning, they’re wasted parameters. too many dead neurons can reduce the network’s capacity and slow learning.&lt;/p&gt;



&lt;p&gt;how to fix it?&lt;/p&gt;

&lt;p&gt;use variants of &lt;code&gt;relu&lt;/code&gt; that allow a small gradient for negative inputs:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;leaky relu&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;leaky_relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ small slope (alpha) for negatives, so gradient never fully dies.&lt;/p&gt;

&lt;p&gt;parametric relu (prelu):&lt;br&gt;
like leaky &lt;code&gt;relu&lt;/code&gt;, but alpha is learned.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;elu&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;elu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;smooth for negatives, helps mean activations stay closer to zero.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;if you came and read this far.. thanks for reading. ✨&lt;/p&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>activation functions - day 03 of dl</title>
      <dc:creator>mahraib</dc:creator>
      <pubDate>Thu, 15 Jan 2026 18:34:18 +0000</pubDate>
      <link>https://dev.to/mahraib_fatima/activation-functions-day-03-of-dl-41fe</link>
      <guid>https://dev.to/mahraib_fatima/activation-functions-day-03-of-dl-41fe</guid>
      <description>&lt;h1&gt;
  
  
  activation functions
&lt;/h1&gt;

&lt;p&gt;a neural network without an activation function is just a giant linear regression model, no matter how many layers. an activation function is a non-linear transformation applied to input weights.&lt;/p&gt;

&lt;h2&gt;
  
  
  sigmoid (logistic function)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;range is &lt;code&gt;(0, 1)&lt;/code&gt;. perfect for binary classification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;flaws:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vanishing gradient.&lt;/li&gt;
&lt;li&gt;computationally expensive (exponential calculation).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;when to use today:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;never in hidden layer.&lt;/li&gt;
&lt;li&gt;use in output layer for classification problems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;side note: dont surprise by the formulas representation or think it's AI generated. i have pretty good experience in latex/math pdf editing.&lt;/p&gt;

&lt;h2&gt;
  
  
  hyperbolic tangent (tanh)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tanh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;range is &lt;code&gt;(-1, 1)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;betterment:&lt;/strong&gt; zero-centered, leading to faster convergence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;flaw:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vanishing gradient with inputs of large magnitude.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;when to use:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sometimes in hidden layers of rnns/lstms.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  rectified linear unit (relu)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;range is &lt;code&gt;[0, ∞)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;betterment:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;solved vanishing gradient problem: derivative is 1 for &lt;code&gt;x &amp;gt; 0&lt;/code&gt;, so gradient flows freely.&lt;/li&gt;
&lt;li&gt;computation is cheap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;flaw:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dying relu.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;when to use:&lt;/strong&gt; 90% used in hidden layers. if it works, don't touch it.&lt;/p&gt;

&lt;h2&gt;
  
  
  leaky relu
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;leaky_relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;betterment:&lt;/strong&gt; provides a small, non-zero step for negative inputs, allowing neurons to recover.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;when to use:&lt;/strong&gt; if the "dying relu" problem occurs (check activation stats).&lt;/p&gt;

&lt;h2&gt;
  
  
  exponential linear unit (elu)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;elu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  gaussian error linear unit (gelu)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gelu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tanh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pi&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.044715&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;betterment:&lt;/strong&gt; default sota for transformers.&lt;/p&gt;

&lt;p&gt;side note: i read my second research paper, but it was the first one i read from a learning perspective, so i'm happy about it. this formula was also copied from a research paper.&lt;/p&gt;

&lt;h2&gt;
  
  
  swish (from google brain)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;swish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;beta&lt;/code&gt; is often 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;when to use:&lt;/strong&gt; good alternative for relu, used for cnn tasks.&lt;/p&gt;

&lt;p&gt;here is the link of github md file which define these formulas.&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>hidden layer - day 02 of dl</title>
      <dc:creator>mahraib</dc:creator>
      <pubDate>Tue, 13 Jan 2026 16:21:23 +0000</pubDate>
      <link>https://dev.to/mahraib_fatima/hidden-layer-3b95</link>
      <guid>https://dev.to/mahraib_fatima/hidden-layer-3b95</guid>
      <description>&lt;p&gt;a &lt;strong&gt;hidden layer&lt;/strong&gt; is an intermediate layer between the input and output layers in a neural network. it's called "hidden" because its outputs are not directly observable as final outputs from the network.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;key points:&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. transformation function:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;each hidden layer performs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;linear transformation&lt;/strong&gt;: &lt;code&gt;z = w·x + b&lt;/code&gt; (weights × inputs + bias)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;matrix representation:&lt;/strong&gt;&lt;br&gt;
        for a hidden layer with &lt;code&gt;m&lt;/code&gt; inputs and &lt;code&gt;n&lt;/code&gt; neurons:&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     hidden layer output = activation(w·x + b)
    where:
      w = weight matrix of shape (n × m)
        x = input vector of shape (m × 1)
          b = bias vector of shape (n × 1)
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;non-linear activation&lt;/strong&gt;: &lt;code&gt;a = f(z)&lt;/code&gt; (relu, sigmoid, tanh, etc.)&lt;br&gt;
impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;sigmoid/tanh&lt;/strong&gt;: early days, suffers from vanishing gradient. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;relu&lt;/strong&gt;: modern default, solves vanishing gradient but has "dying relu" problem. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;leaky relu/elu&lt;/strong&gt;: address dying relu issue. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;swish/mish&lt;/strong&gt;: recent alternatives, often better performance. &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;activation functions will be discuss in details.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. what happens in a hidden layer:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;feature extraction&lt;/strong&gt;: learns patterns from previous layer's outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;hierarchical learning&lt;/strong&gt;: early layers learn simple features, deeper layers combine them.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. why are hidden layers so important?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;example: cat image classification&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;layer&lt;/th&gt;
&lt;th&gt;what it "sees"&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;input&lt;/td&gt;
&lt;td&gt;raw pixels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hidden 1&lt;/td&gt;
&lt;td&gt;edge detectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hidden 2&lt;/td&gt;
&lt;td&gt;texture patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hidden 3&lt;/td&gt;
&lt;td&gt;object parts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hidden 4&lt;/td&gt;
&lt;td&gt;whole objects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;output&lt;/td&gt;
&lt;td&gt;classification&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;the "deep" in deep learning:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;the term &lt;strong&gt;"deep"&lt;/strong&gt; in deep learning specifically refers to having &lt;strong&gt;multiple hidden layers&lt;/strong&gt;. this depth enables:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;automatic feature engineering&lt;/strong&gt;: no need for manual feature extraction. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;hierarchical understanding&lt;/strong&gt;: from pixels to concepts. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;transfer learning&lt;/strong&gt;: early layers often learn general features transferable between tasks. &lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;the takeaway:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;hidden layers are &lt;strong&gt;learned feature extractors&lt;/strong&gt;.the &lt;strong&gt;depth&lt;/strong&gt; (number of hidden layers) and &lt;strong&gt;architecture&lt;/strong&gt; of these layers determine what kind of patterns the network can learn and how well it can learn them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;without hidden layers, neural networks would be just linear regression. with them, they can learn the complex patterns that power modern ai applications.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>perceptron - day 01 of dl</title>
      <dc:creator>mahraib</dc:creator>
      <pubDate>Mon, 12 Jan 2026 19:36:55 +0000</pubDate>
      <link>https://dev.to/mahraib_fatima/perceptron-day-01-of-dl-4lka</link>
      <guid>https://dev.to/mahraib_fatima/perceptron-day-01-of-dl-4lka</guid>
      <description>&lt;p&gt;while starting learning neural networks, perceptron is the first thing. it's simple and shows how learning from points works.&lt;/p&gt;

&lt;h2&gt;
  
  
  how it works
&lt;/h2&gt;

&lt;p&gt;a perceptron draws a straight line to separate two types of data. it calculates:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;output = w1*x1 + w2*x2 + ... + b&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;if output is positive, it says "class a". &lt;br&gt;
if negative, "class b".&lt;/p&gt;

&lt;p&gt;to learn, it uses this trick:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;start with random weights.&lt;/li&gt;
&lt;li&gt;check one point.&lt;/li&gt;
&lt;li&gt;if wrong, adjust weights toward that point.&lt;/li&gt;
&lt;li&gt;repeat until all points are right.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;the update looks like this:&lt;br&gt;
&lt;code&gt;new weight = old weight + learning rate * (true label - predicted label) * input&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;simple idea: if you're wrong, move the line toward the mistake.&lt;/p&gt;

&lt;h2&gt;
  
  
  the problem
&lt;/h2&gt;

&lt;p&gt;the perceptron stops as soon as all training points are correct. but there are often many possible lines that all work perfectly.&lt;/p&gt;

&lt;p&gt;imagine separating two groups of points. you could draw the line close to one group, close to the other, or in the middle. all would be 100% correct on your training data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfagc2co181440qcrqyp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfagc2co181440qcrqyp.png" alt=" " width="726" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;the perceptron picks whichever line it finds first, it would be line A, B or C. which one you, get depends on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;random starting weights.&lt;/li&gt;
&lt;li&gt;the order of points.&lt;/li&gt;
&lt;li&gt;luck.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;it has one big flaw: &lt;code&gt;it finds any solution that works, not the best one.&lt;/code&gt;&lt;br&gt;
train twice, get two different lines. both work on your training data, but one might be much better than the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  why this matters
&lt;/h2&gt;

&lt;p&gt;a line that just barely separates the data is fragile. real data has noise. new points won't be exactly like your training points. a tight boundary will make mistakes easily.&lt;/p&gt;

&lt;p&gt;what we want is the line in the middle of the gap, farthest from both groups. this is more robust and handles new data better.&lt;/p&gt;

&lt;h2&gt;
  
  
  how loss functions help
&lt;/h2&gt;

&lt;p&gt;loss functions change the question. instead of "is this wrong?" they ask "how wrong is this?" or "how confidently right is this?"&lt;/p&gt;

&lt;p&gt;look at hinge loss:&lt;br&gt;
&lt;code&gt;loss = max(0, 1 - true label * prediction)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;even if a point is correct, there's still loss if the prediction isn't confident enough. this pushes the line away from points, creating a safety margin.&lt;/p&gt;

&lt;h2&gt;
  
  
  gradient descent: better learning
&lt;/h2&gt;

&lt;p&gt;with loss functions, we don't update based on single points. we look at all data and find the average error. then we adjust weights to reduce this error most effectively.&lt;/p&gt;

&lt;p&gt;this is gradient descent:&lt;br&gt;
&lt;code&gt;new weight = old weight - learning rate * slope of loss&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;the minus sign is key: we go downhill toward lower error.&lt;/p&gt;

&lt;h2&gt;
  
  
  the takeaway
&lt;/h2&gt;

&lt;p&gt;the perceptron shows the basics of learning. but it sees the world as binary: right or wrong.&lt;/p&gt;

&lt;p&gt;real problems need more nuance. loss functions provide that. they let us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;work with data that can't be perfectly separated.&lt;/li&gt;
&lt;li&gt;measure degrees of wrongness.&lt;/li&gt;
&lt;li&gt;build robust classifiers.&lt;/li&gt;
&lt;li&gt;handle multiple classes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;that's why modern neural networks use loss functions with gradient descent. it turns a simple rule follower into a true learner that handles real world complexity.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>day 0 of deep learning</title>
      <dc:creator>mahraib</dc:creator>
      <pubDate>Sat, 10 Jan 2026 16:53:50 +0000</pubDate>
      <link>https://dev.to/mahraib_fatima/day-0-of-deep-learning-5781</link>
      <guid>https://dev.to/mahraib_fatima/day-0-of-deep-learning-5781</guid>
      <description>&lt;p&gt;My name is Mahraib Fatima. I am a final year student who loves building and learning new things. &lt;br&gt;
I have learned machine learning and basic web dev already in my 3rd year of bachelor, self learning u know. Now, my goal is to learn deep learning in depth. &lt;/p&gt;

&lt;p&gt;Here, i am going to share my journey for next few months until i get enough confidence in my deep learning knowledge. &lt;/p&gt;

&lt;p&gt;This journey will include in depth study, mini projects and 3 main projects. &lt;/p&gt;

&lt;p&gt;Thanks for reading. &lt;br&gt;
Let's connect: &lt;a href="http://mahraib.works" rel="noopener noreferrer"&gt;http://mahraib.works&lt;/a&gt;&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
