<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Juhi Kushwah</title>
    <description>The latest articles on DEV Community by Juhi Kushwah (@juhikushwah).</description>
    <link>https://dev.to/juhikushwah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F757974%2F13dc9880-a89e-43a6-a850-ae2c04d2ad85.png</url>
      <title>DEV Community: Juhi Kushwah</title>
      <link>https://dev.to/juhikushwah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/juhikushwah"/>
    <language>en</language>
    <item>
      <title>How does a machine actually learn from data?</title>
      <dc:creator>Juhi Kushwah</dc:creator>
      <pubDate>Wed, 14 Jan 2026 07:57:30 +0000</pubDate>
      <link>https://dev.to/juhikushwah/how-does-a-machine-actually-learn-from-data-12be</link>
      <guid>https://dev.to/juhikushwah/how-does-a-machine-actually-learn-from-data-12be</guid>
      <description>&lt;p&gt;&lt;em&gt;I was discussing this with my co-worker (who is also an ML engineer) as to how a beginner like me should approach machine learning? She said now that I've &lt;strong&gt;intentionally mastered NumPy → Pandas → Data Preprocessing conceptually, the next concept should NOT be “more tools”.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It should be &lt;strong&gt;ML thinking itself&lt;/strong&gt;!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Her suggestion, somehow, did not sit well with me—partly because there are endless tools if you think about it! I had narrowed things down to &lt;strong&gt;NumPy&lt;/strong&gt;, &lt;strong&gt;Pandas&lt;/strong&gt;, &lt;strong&gt;Data Preprocessing&lt;/strong&gt; and &lt;strong&gt;Scikit-learn&lt;/strong&gt; (I haven’t covered this topic yet, for reasons I’ll explain as we dive deeper into this post) based on my own understanding of the subject. However, what she said next made more sense to me, because this is where my perspective as a software engineer comes into play—it’s important to understand the &lt;strong&gt;mental model behind algorithms&lt;/strong&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;If you are an iterative learner like me, you're right to pause here and think about why we shouldn't jump into &lt;strong&gt;scikit-learn&lt;/strong&gt; before understanding how learning itself works?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Short answer(the important one): &lt;strong&gt;learn &lt;em&gt;just enough&lt;/em&gt; scikit-learn, but &lt;em&gt;after&lt;/em&gt; you understand how learning works.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let me elaborate on this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎯 The Correct Order (Beginner-Optimal)&lt;/strong&gt;&lt;br&gt;
You should &lt;strong&gt;NOT&lt;/strong&gt; fully learn scikit-learn &lt;em&gt;before&lt;/em&gt; understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what a model is&lt;/li&gt;
&lt;li&gt;what loss is&lt;/li&gt;
&lt;li&gt;what training means&lt;/li&gt;
&lt;li&gt;what overfitting is&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Otherwise, scikit-learn becomes a &lt;strong&gt;black box&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧠 Think of scikit-learn like this&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Concepts → &lt;em&gt;why something works&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;scikit-learn → &lt;em&gt;how to apply it quickly&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you reverse this order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model = LinearRegression()
model.fit(X, y)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can run code — but you &lt;strong&gt;don’t actually know what happened&lt;/strong&gt;!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;why it works?&lt;/li&gt;
&lt;li&gt;when it fails?&lt;/li&gt;
&lt;li&gt;what assumptions it makes?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, you(a beginner) should learn &lt;strong&gt;learning types + core ML ideas&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅ What You SHOULD do instead (Best approach)&lt;/strong&gt;&lt;br&gt;
&lt;u&gt;Step 1️⃣ — Learn learning concepts (NO scikit-learn yet)&lt;/u&gt;&lt;br&gt;
(This is what we are already doing)&lt;/p&gt;

&lt;p&gt;Learn conceptually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supervised learning&lt;/li&gt;
&lt;li&gt;Regression vs classification&lt;/li&gt;
&lt;li&gt;Model = function&lt;/li&gt;
&lt;li&gt;Loss function&lt;/li&gt;
&lt;li&gt;Overfitting vs underfitting&lt;/li&gt;
&lt;li&gt;Train vs test behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 This can be done with &lt;strong&gt;math intuition + NumPy&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Step 2️⃣ — Implement Linear Regression from scratch&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NumPy&lt;/li&gt;
&lt;li&gt;A few lines of math&lt;/li&gt;
&lt;li&gt;No ML libraries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How does the model actually learn?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;u&gt;Step 3️⃣ — THEN introduce scikit-learn (lightly)&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Once the concept clicks, scikit-learn becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean&lt;/li&gt;
&lt;li&gt;Logical&lt;/li&gt;
&lt;li&gt;Easy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You’ll instantly understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.fit()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.predict()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.score()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;❌ What NOT to do (common beginner mistake)&lt;/strong&gt;&lt;br&gt;
❌ Deep dive into scikit-learn API&lt;br&gt;
❌ Memorize classifiers and parameters&lt;br&gt;
❌ Jump to advanced models too early&lt;/p&gt;

&lt;p&gt;This creates fragile understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧭 Minimal scikit-learn you may peek at (optional)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s okay to &lt;em&gt;recognize&lt;/em&gt; these, not master them yet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(You already used these in previous posts.)&lt;/p&gt;

&lt;p&gt;But &lt;strong&gt;don’t learn models yet&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🎯 The Next Beginner ML Concept: Supervised Learning Fundamentals&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔑 Concept 1: Types of Machine Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1️⃣ Supervised Learning (START HERE)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input features (X)&lt;/li&gt;
&lt;li&gt;Correct answers (y)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predict salary → regression&lt;/li&gt;
&lt;li&gt;Predict spam/not spam → classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is &lt;strong&gt;90% of beginner ML&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;2️⃣ Unsupervised Learning (later)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No labels.&lt;/li&gt;
&lt;li&gt;Model finds structure itself.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Examples:&lt;br&gt;
Customer segmentation: → “Group similar customers”&lt;br&gt;
Clustering→ “The method used to form those groups”&lt;/p&gt;



&lt;p&gt;3️⃣ Reinforcement Learning (much later)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent learns via rewards.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;📌 &lt;strong&gt;For now&lt;/strong&gt;: Focus ONLY on &lt;strong&gt;Supervised Learning&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🔑 Concept 2: Regression vs Classification
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🟦 Regression&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Predict a &lt;strong&gt;number&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;House price → $250,000
Temperature → 28.5°C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;🟥 Classification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Predict a &lt;strong&gt;category&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Spam / Not Spam
Yes / No
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;🧠 Tiny mental exercise&lt;/strong&gt;&lt;br&gt;
Which is which?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| Problem            | Type           |
| ------------------ | -------------- |
| Predict exam score | Regression     |
| Predict pass/fail  | Classification |

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔑 Concept 3: Model, Parameters &amp;amp; Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🧠 What is a model?&lt;/strong&gt;&lt;br&gt;
A &lt;strong&gt;mathematical function&lt;/strong&gt; that maps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X → y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;y = w*x + b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;w&lt;/code&gt; → weight (importance)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;b&lt;/code&gt; → bias (offset)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Learning = finding &lt;strong&gt;best w and b&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔑 Concept 4: Loss Function (VERY IMPORTANT)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🧠 What is loss?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;“How wrong is the model?”&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;True value = 100&lt;/li&gt;
&lt;li&gt;Prediction = 90&lt;/li&gt;
&lt;li&gt;Error = 10&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Loss function &lt;strong&gt;quantifies this error&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Common:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mean Squared Error (MSE)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔑 Concept 5: Training vs Prediction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Training phase:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model sees data&lt;/li&gt;
&lt;li&gt;Adjusts parameters&lt;/li&gt;
&lt;li&gt;Minimizes loss&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prediction phase:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model is frozen&lt;/li&gt;
&lt;li&gt;Makes predictions on new data&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔑 Concept 6: Overfitting vs Underfitting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Underfitting:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model too simple&lt;/li&gt;
&lt;li&gt;Misses patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Overfitting:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model memorizes data&lt;/li&gt;
&lt;li&gt;Fails on new data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;📌 This is the heart of ML.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔑 Concept 7: Evaluation Metrics (Conceptual)
&lt;/h2&gt;

&lt;p&gt;You don’t evaluate a model on training data.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regression → MSE, RMSE, R²&lt;/li&gt;
&lt;li&gt;Classification → Accuracy, Precision, Recall&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(You’ll learn these slowly — concept first.)&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I know I’ve introduced a few advanced terms at a beginner level to give an idea of what the roadmap to understanding machine learning looks like. Don’t worry if they feel unfamiliar right now — I’ll be exploring each of these topics in depth as we go.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can refer to these posts for understanding NumPy, Pandas and Data Preprocessing:&lt;br&gt;
&lt;a href="https://dev.to/juhikushwah/understanding-numpy-in-the-context-of-python-for-machine-learning-43i7"&gt;Understanding NumPy in the context of Python for Machine Learning&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/juhikushwah/the-next-basic-concept-of-machine-learning-after-numpy-pandas-4h4a"&gt;The next basic concept of Machine Learning after NumPy: Pandas&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/juhikushwah/understanding-data-preprocessing-4g6g"&gt;Understanding Data Preprocessing&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/juhikushwah/beginner-friendly-exercises-on-numpy-pandas-and-data-preprocessing-3af5"&gt;Beginner-friendly exercises on NumPy, Pandas and Data Preprocessing&lt;/a&gt;&lt;/p&gt;

</description>
      <category>100daysofcode</category>
      <category>mlbasics</category>
      <category>learningmodels</category>
      <category>scikitlearn</category>
    </item>
    <item>
      <title>Beginner-friendly exercises on NumPy, Pandas and Data Preprocessing</title>
      <dc:creator>Juhi Kushwah</dc:creator>
      <pubDate>Thu, 08 Jan 2026 07:52:45 +0000</pubDate>
      <link>https://dev.to/juhikushwah/beginner-friendly-exercises-on-numpy-pandas-and-data-preprocessing-3af5</link>
      <guid>https://dev.to/juhikushwah/beginner-friendly-exercises-on-numpy-pandas-and-data-preprocessing-3af5</guid>
      <description>&lt;p&gt;&lt;em&gt;Before diving deep into Machine Learning, I would like to share tiny, beginner-friendly code-based exercises based on NumPy, Pandas and Data Preprocessing - &lt;strong&gt;small, focused and ML oriented&lt;/strong&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ✅ NumPy Mini Exercises (Level: Very Easy)
&lt;/h2&gt;

&lt;p&gt;Make sure you import NumPy first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 1 — Create a NumPy array&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Create a NumPy array containing these numbers:&lt;br&gt;
[2, 4, 6, 8]&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a = np.array([2, 4, 6, 8])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 2 — Create a 2D array&lt;/u&gt;&lt;br&gt;
Create this 2×2 matrix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1 2
3 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;m = np.array([[1, 2],
              [3, 4]])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 3 — Array shape&lt;/u&gt;&lt;br&gt;
Find the &lt;strong&gt;shape&lt;/strong&gt; of this array:&lt;br&gt;
&lt;code&gt;a = np.array([[10, 20, 30], [40, 50, 60]])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a = np.array([[10, 20, 30], [40, 50, 60]])
a.shape
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(2, 3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 4 — Element-wise operations&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a = np.array([1, 2, 3])
b = np.array([10, 20, 30])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compute:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; a + b&lt;/li&gt;
&lt;li&gt; a * b&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution:&lt;br&gt;
I. Addition&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a + b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([11, 22, 33])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;II. Multiplication&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a * b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([10, 40, 90])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 5 — Slicing&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;code&gt;a = np.array([5, 10, 15, 20, 25])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Extract the middle three values:&lt;br&gt;
&lt;code&gt;[10, 15, 20]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a = np.array([5, 10, 15, 20, 25])
middle = a[1:4]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([10, 15, 20])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 6 — Zero and Ones arrays&lt;/u&gt;&lt;br&gt;
Create:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A &lt;strong&gt;3×3 matrix of zeros&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;2×4 matrix of ones&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution:&lt;br&gt;
I. 3×3 matrix of zeros&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;np.zeros((3, 3))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;II. 2×4 matrix of ones&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;np.ones((2, 4))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 7 — Random numbers&lt;/u&gt;&lt;br&gt;
Generate a NumPy array of &lt;strong&gt;five random numbers&lt;/strong&gt; between 0 and 1.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;r = np.random.rand(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([0.23, 0.91, 0.49, 0.11, 0.76])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 8 — Matrix multiplication&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compute:&lt;br&gt;
&lt;code&gt;A @ B&lt;/code&gt;&lt;br&gt;
(or &lt;code&gt;np.dot(A, B)&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

A @ B   # or np.dot(A, B)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([[19, 22],
       [43, 50]])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 9 — Mean of an array&lt;/u&gt;&lt;br&gt;
Compute the &lt;strong&gt;mean&lt;/strong&gt; of:&lt;br&gt;
&lt;code&gt;x = np.array([4, 8, 12, 16])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = np.array([4, 8, 12, 16])
np.mean(x)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 10 — Reshape&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;code&gt;x = np.array([1, 2, 3, 4, 5, 6])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Reshape it into a &lt;strong&gt;2×3 matrix&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = np.array([1, 2, 3, 4, 5, 6])
x.reshape(2, 3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([[1, 2, 3],
       [4, 5, 6]])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📊 Pandas Mini Exercises (Level: Very Easy)
&lt;/h2&gt;

&lt;p&gt;Start with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 1 — Create a DataFrame&lt;/u&gt;&lt;br&gt;
Create a DataFrame from this dictionary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dictionary keys → column names&lt;/li&gt;
&lt;li&gt;Lists → column values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Age    Salary
0   25     50000
1   30     60000
2   35     70000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 2 — View data&lt;/u&gt;&lt;br&gt;
Using the DataFrame from Exercise 1:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Display the &lt;strong&gt;first 2 rows&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Display the &lt;strong&gt;column names&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Display the &lt;strong&gt;shape&lt;/strong&gt; of the DataFrame&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.head(2)
df.columns
df.shape
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;head(2) → first 2 rows&lt;/li&gt;
&lt;li&gt;columns → column names&lt;/li&gt;
&lt;li&gt;shape → (rows, columns)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#first 2 rows
   Age  Salary  
0   25   50000
1   30   60000

#column names
Index(['Age', 'Salary'], dtype='object')

#3 rows, 2 columns
(3, 2)   
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 3 — Select a column&lt;/u&gt;&lt;br&gt;
Select only the &lt;strong&gt;Salary&lt;/strong&gt; column.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df["Salary"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single brackets → returns a &lt;strong&gt;Series&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#This is a Series, not a DataFrame

0    50000
1    60000
2    70000
Name: Salary, dtype: int64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 4 — Select multiple columns&lt;/u&gt;&lt;br&gt;
Select &lt;strong&gt;Age&lt;/strong&gt; and &lt;strong&gt;Salary&lt;/strong&gt; together.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df[["Age", "Salary"]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Double brackets → returns a &lt;strong&gt;DataFrame&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Double brackets → DataFrame

   Age  Salary
0   25   50000
1   30   60000
2   35   70000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 5 — Filter rows&lt;/u&gt;&lt;br&gt;
From the DataFrame, select rows where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Age &amp;gt; 28&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df[df["Age"] &amp;gt; 28]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Boolean condition filters rows; it is a &lt;strong&gt;core Pandas skill&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Very common in data cleaning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Age  Salary
1   30   60000
2   35   70000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 6 — Add a new column&lt;/u&gt;&lt;br&gt;
Add a column called &lt;strong&gt;Tax&lt;/strong&gt; which is &lt;strong&gt;10% of Salary&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df["Tax"] = 0.10 * df["Salary"]
df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pandas supports vectorized operations&lt;/li&gt;
&lt;li&gt;Applied to entire column at once&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Operations apply row-wise automatically

   Age  Salary     Tax
0   25   50000  5000.0
1   30   60000  6000.0
2   35   70000  7000.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 7 — Basic statistics&lt;/u&gt;&lt;br&gt;
Compute:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Mean Age&lt;/li&gt;
&lt;li&gt;Maximum Salary&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df["Age"].mean()
df["Salary"].max()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pandas has built-in descriptive stats&lt;/li&gt;
&lt;li&gt;Used heavily during EDA&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Mean Age
30.0

#Maximum Salary
70000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 8 — Handle missing values&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = {
    "Age": [25, None, 35],
    "Salary": [50000, 60000, None]
}
df = pd.DataFrame(data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Detect missing values&lt;/li&gt;
&lt;li&gt;Fill missing values with the &lt;strong&gt;column mean&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.isnull()
df_filled = df.fillna(df.mean())
df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;isnull()&lt;/code&gt; → detects missing values&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fillna(df.mean())&lt;/code&gt; → fills numeric NaNs with column mean&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Age   Salary
0  25.0  50000.0
1   NaN  60000.0
2  35.0      NaN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Breaking this down here:&lt;br&gt;
✅ Detect missing values&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.isnull()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     Age  Salary
0  False   False
1   True   False
2  False    True
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✅ Fill missing values with mean&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_filled = df.fillna(df.mean())
df_filled
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Age   Salary
0  25.0  50000.0
1  30.0  60000.0
2  35.0  55000.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Means used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Age mean = 30&lt;/li&gt;
&lt;li&gt;Salary mean = 55,000&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Exercise 9 — Sort values&lt;/u&gt;&lt;br&gt;
Sort the DataFrame by &lt;strong&gt;Salary (descending order)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.sort_values(by="Salary", ascending=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sorting helps identify top/bottom values&lt;/li&gt;
&lt;li&gt;Common during analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Age   Salary
1  30.0  60000.0
2  35.0  55000.0
0  25.0  50000.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 10 — Convert to NumPy (ML step)&lt;/u&gt;&lt;br&gt;
Convert:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Features → &lt;code&gt;Age&lt;/code&gt;, &lt;code&gt;Salary&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Target → &lt;code&gt;Tax&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;into NumPy arrays.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X = df[["Age", "Salary"]].values
y = df["Tax"].values

X
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.values&lt;/code&gt; converts Pandas → NumPy&lt;/li&gt;
&lt;li&gt;scikit-learn expects NumPy arrays&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([[2.5e+01, 5.0e+04],
       [3.0e+01, 6.0e+04],
       [3.5e+01, 5.5e+04]])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✅ Target&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([50000., 60000., 55000.])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 This is exactly the format ML models expect&lt;/p&gt;

&lt;h2&gt;
  
  
  🧪 Data Preprocessing: Code-Based Mini Exercises
&lt;/h2&gt;

&lt;p&gt;Start with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 1 — Train/Test Split (Ratio practice)&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Split the data into &lt;strong&gt;80% training and 20% testing&lt;/strong&gt;.&lt;br&gt;
Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.model_selection import train_test_split
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.model_selection import train_test_split

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;test_size=0.2&lt;/code&gt; → 20% test, 80% train&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;random_state&lt;/code&gt; ensures reproducibility&lt;/li&gt;
&lt;li&gt;Model learns from &lt;code&gt;X_train&lt;/code&gt;, evaluated on &lt;code&gt;X_test&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output (one possible split with &lt;code&gt;random_state=42&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X_train = [[4], [2], [5], [3]]
X_test  = [[1]]

y_train = [40, 20, 50, 30]
y_test  = [10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 2 — Detect missing values&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = {
    "Age": [25, 30, None, 40],
    "Salary": [50000, None, 70000, 80000]
}
df = pd.DataFrame(data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Write code to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Detect missing values&lt;/li&gt;
&lt;li&gt;Count missing values per column&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.isnull()

df.isnull().sum()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;isnull()&lt;/code&gt; → True/False for each cell&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sum()&lt;/code&gt; counts missing values per column&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output of &lt;code&gt;df.isnull()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     Age  Salary
0  False   False
1  False    True
2   True   False
3  False   False
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output of &lt;code&gt;df.isnull().sum()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Age       1
Salary    1
dtype: int64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 3 — Fill missing values (Mean)&lt;/u&gt;&lt;br&gt;
Using the same DataFrame above:&lt;br&gt;
👉 Fill missing values using &lt;strong&gt;column mean&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_filled = df.fillna(df.mean())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replaces NaN with column mean&lt;/li&gt;
&lt;li&gt;Common for numerical ML features&lt;/li&gt;
&lt;li&gt;Keeps dataset size intact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Age   Salary
0  25.0  50000.0
1  30.0  66666.7
2  31.7  70000.0
3  40.0  80000.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(Mean values used)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Exercise 4 — One-Hot Encoding (Categorical Data)&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.DataFrame({
    "City": ["Delhi", "Mumbai", "Delhi", "Chennai"]
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Convert &lt;code&gt;City&lt;/code&gt; into numerical columns using &lt;strong&gt;one-hot encoding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;encoded_df = pd.get_dummies(df["City"])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OR keep original structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;encoded_df = pd.get_dummies(df, columns=["City"])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Converts text categories into binary columns&lt;/li&gt;
&lt;li&gt;Avoids false numeric ordering&lt;/li&gt;
&lt;li&gt;Required before ML models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;City
Delhi
Mumbai
Delhi
Chennai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   City_Chennai  City_Delhi  City_Mumbai
0             0           1             0
1             0           0             1
2             0           1             0
3             1           0             0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 5 — Feature Scaling (Standardization)&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X = np.array([
    [20, 30000],
    [30, 50000],
    [40, 70000]
])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Apply &lt;strong&gt;Standard Scaling&lt;/strong&gt; to &lt;code&gt;X&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.preprocessing import StandardScaler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Centers data around mean = 0&lt;/li&gt;
&lt;li&gt;Std deviation = 1&lt;/li&gt;
&lt;li&gt;Essential for distance-based models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output(approx):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[[-1.2247, -1.2247],
 [ 0.0000,  0.0000],
 [ 1.2247,  1.2247]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 6 — Feature Selection&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.DataFrame({
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000],
    "EmployeeID": [101, 102, 103]
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Remove the &lt;strong&gt;EmployeeID&lt;/strong&gt; column.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_selected = df.drop("EmployeeID", axis=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IDs carry no predictive value&lt;/li&gt;
&lt;li&gt;Removing noise improves model learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Input columns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['Age', 'Salary', 'EmployeeID']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output columns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['Age', 'Salary']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 7 — Outlier Detection (Simple logic)&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;code&gt;ages = np.array([22, 23, 24, 25, 120])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;👉 Write code to &lt;strong&gt;remove values greater than 100&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filtered_ages = ages[ages &amp;lt;= 100]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple rule-based filtering&lt;/li&gt;
&lt;li&gt;Useful for obvious data errors&lt;/li&gt;
&lt;li&gt;Always inspect before removing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[22, 23, 24, 25]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;u&gt;Exercise 8 — Data Leakage Check (Thinking + Code)&lt;/u&gt;&lt;br&gt;
Given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.preprocessing import StandardScaler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Write the &lt;strong&gt;correct order of code&lt;/strong&gt; to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Split data&lt;/li&gt;
&lt;li&gt;Fit scaler on training data&lt;/li&gt;
&lt;li&gt;Transform both training and test data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(No need to run it — just write the correct sequence.)&lt;/p&gt;

&lt;p&gt;Solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 2. Fit scaler ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# 3. Apply same scaler to test data
X_test_scaled = scaler.transform(X_test)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Explanation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training data defines statistics&lt;/li&gt;
&lt;li&gt;Test data must remain unseen&lt;/li&gt;
&lt;li&gt;Prevents unrealistically high accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output (conceptual sequence):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Split data
2. Fit scaler on training data
3. Transform training data
4. Transform test data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;If you can do these exercises comfortably, you’re &lt;strong&gt;ML-ready at a foundational level&lt;/strong&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you are new to python, you can install python version 3.x and try playing around with these exercises in your IDE. I use Jupyter notebook.&lt;/p&gt;

&lt;h2&gt;
  
  
  📌Recommendation (if you're a beginner):
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Do NOT learn scikit-learn models yet.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;First learn how a model learns.&lt;/strong&gt;&lt;br&gt;
Then use scikit-learn as a tool, not a teacher.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Will be exploring more on this in subsequent posts. &lt;/p&gt;

&lt;p&gt;You can refer to these posts for understanding NumPy, Pandas and Data Preprocessing:&lt;br&gt;
&lt;a href="https://dev.to/juhikushwah/understanding-numpy-in-the-context-of-python-for-machine-learning-43i7"&gt;Understanding NumPy in the context of Python for Machine Learning&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/juhikushwah/the-next-basic-concept-of-machine-learning-after-numpy-pandas-4h4a"&gt;The next basic concept of Machine Learning after NumPy: Pandas&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/juhikushwah/understanding-data-preprocessing-4g6g"&gt;Understanding Data Preprocessing&lt;/a&gt;&lt;/p&gt;

</description>
      <category>100daysofcode</category>
      <category>mlbasics</category>
    </item>
    <item>
      <title>Understanding Data Preprocessing</title>
      <dc:creator>Juhi Kushwah</dc:creator>
      <pubDate>Wed, 07 Jan 2026 07:48:11 +0000</pubDate>
      <link>https://dev.to/juhikushwah/understanding-data-preprocessing-4g6g</link>
      <guid>https://dev.to/juhikushwah/understanding-data-preprocessing-4g6g</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;Data Preprocessing&lt;/strong&gt; - this is exactly the right next step after Pandas. Think of Data Preprocessing as the bridge between &lt;strong&gt;raw data&lt;/strong&gt; and &lt;strong&gt;usable ML input&lt;/strong&gt;. You can find more information Pandas here:&lt;/em&gt; &lt;a href="https://dev.to/juhikushwah/the-next-basic-concept-of-machine-learning-after-numpy-pandas-4h4a"&gt;The next basic concept of Machine Learning after NumPy: Pandas&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Data Preprocessing in Machine Learning?
&lt;/h2&gt;

&lt;p&gt;Data preprocessing is the process of &lt;strong&gt;cleaning, transforming, and preparing data&lt;/strong&gt; so that a machine learning model can learn from it effectively.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A model can only be as good as the data you feed it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Why Data Preprocessing is Critical?&lt;/strong&gt;&lt;br&gt;
Raw data usually has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Different scales (Age vs Salary)&lt;/li&gt;
&lt;li&gt;Categorical text values&lt;/li&gt;
&lt;li&gt;Noise &amp;amp; irrelevant features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ML algorithms &lt;strong&gt;assume clean, numerical, well-scaled data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core Data Preprocessing Concepts (Must-Know)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;u&gt;Train–Test Split&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concept&lt;/strong&gt;: We don’t train and evaluate on the same data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training set → learn patterns&lt;/li&gt;
&lt;li&gt;Test set → evaluate performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What does it mean?&lt;/strong&gt;&lt;br&gt;
   We divide data into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training data → teaches the model&lt;/li&gt;
&lt;li&gt;Testing data → checks how well it learned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typical split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;80% train / 20% test
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;strong&gt;Why 80:20 or 70:30?&lt;/strong&gt;&lt;br&gt;
     Imagine you have 100 exam questions: &lt;br&gt;
     - You &lt;strong&gt;practice&lt;/strong&gt; with 80 questions&lt;br&gt;
     - You &lt;strong&gt;test yourself&lt;/strong&gt; with 20 new ones&lt;/p&gt;

&lt;p&gt;If you test on questions you already practiced → false confidence.&lt;br&gt;
   &lt;strong&gt;Common ratios:&lt;/strong&gt;&lt;br&gt;
     - &lt;strong&gt;80% train / 20% test&lt;/strong&gt; → most common&lt;br&gt;
     - &lt;strong&gt;70% / 30%&lt;/strong&gt; → small datasets&lt;br&gt;
     - &lt;strong&gt;90% / 10%&lt;/strong&gt; → very large datasets&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you use too much training:&lt;/strong&gt;&lt;br&gt;
     - Test set too small → unreliable evaluation&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you use too much testing:&lt;/strong&gt;&lt;br&gt;
     - Model doesn’t learn enough&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters&lt;/strong&gt;: Prevents &lt;strong&gt;overfitting&lt;/strong&gt; and gives realistic performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;u&gt;Handling Missing Values&lt;/u&gt;&lt;br&gt;
&lt;strong&gt;Problem&lt;/strong&gt;: ML models cannot work with &lt;code&gt;NaN&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common strategies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove rows/columns (small datasets → risky)&lt;/li&gt;
&lt;li&gt;Replace with:

&lt;ul&gt;
&lt;li&gt;Mean / Median (numerical)&lt;/li&gt;
&lt;li&gt;Mode (categorical)
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; df.fillna(df.mean())
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;median&lt;/strong&gt; if data has outliers&lt;/li&gt;
&lt;li&gt;Never fill test data using test statistics (data leakage!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What is an outlier?&lt;/strong&gt;&lt;br&gt;
 A value that is &lt;strong&gt;very different&lt;/strong&gt; from the rest.&lt;br&gt;
 Example:&lt;br&gt;
 Salaries in a company:&lt;br&gt;
 &lt;code&gt;[45k, 48k, 50k, 52k, 49k, 2,000k]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That &lt;strong&gt;2,000k (2 million)&lt;/strong&gt; salary:&lt;br&gt;
 -Skews the average&lt;br&gt;
 -Confuses the model&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it’s bad?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mean salary becomes unrealistic&lt;/li&gt;
&lt;li&gt;Model learns wrong patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What we do:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove it&lt;/li&gt;
&lt;li&gt;Cap it&lt;/li&gt;
&lt;li&gt;Use median instead of mean&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;u&gt;Encoding Categorical Variables&lt;/u&gt;&lt;br&gt;
&lt;strong&gt;Problem&lt;/strong&gt;: ML models only understand numbers, not texts.&lt;br&gt;
&lt;strong&gt;Types:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Label Encoding&lt;/strong&gt; → ordered categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One-Hot Encoding&lt;/strong&gt; → unordered categories (most common)
&lt;code&gt;pd.get_dummies(df["City"])&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;code&gt;City = ["Delhi", "New York City", "Delhi"]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;❌** Wrong Way:**&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Delhi = 1
New York City = 2
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;Model thinks New York City &amp;gt; Delhi ❌ (no meaning!)&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Correct way: One-Hot Encoding&lt;/strong&gt;&lt;br&gt;
Create separate columns:&lt;br&gt;
&lt;strong&gt;[Delhi]&lt;/strong&gt; = [1, 0, 1]&lt;br&gt;
&lt;strong&gt;[New York City]&lt;/strong&gt; = [0, 1, 0]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No false ordering&lt;/li&gt;
&lt;li&gt;Model understands categories correctly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key idea:&lt;/strong&gt;&lt;br&gt;
Never give false numeric meaning to categories.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;u&gt;Feature Scaling&lt;/u&gt;&lt;br&gt;
&lt;strong&gt;Problem&lt;/strong&gt;: Features have different ranges. Different features have different scales.&lt;br&gt;
Example 1:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Age → 0–100&lt;/li&gt;
&lt;li&gt;Salary → 0–100000&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This breaks distance-based models (KNN, SVM).&lt;/p&gt;

&lt;p&gt;Example 2:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Age → 18–60&lt;/li&gt;
&lt;li&gt;Salary → 20,000–200,000&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model pays more attention to Salary just because numbers are bigger ❌&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Solution: Scaling&lt;/strong&gt;&lt;br&gt;
Bring all values to similar ranges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two common methods:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standardization&lt;/strong&gt; → most used&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normalization&lt;/strong&gt; → 0 to 1 range&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🔹 Standardization (most used) = &lt;code&gt;(x − mean) / std&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   from sklearn.preprocessing import StandardScaler
   scaler = StandardScaler()
   X_scaled = scaler.fit_transform(X)
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;🔹 Normalization = &lt;code&gt;(x − min) / (max − min)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fit scaler on &lt;strong&gt;training data only&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Apply the same transformation to test data&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;u&gt;Feature Selection&lt;/u&gt;&lt;br&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Keep only useful features and remove useless ones.&lt;br&gt;
 Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduces noise&lt;/li&gt;
&lt;li&gt;Improves performance&lt;/li&gt;
&lt;li&gt;Avoids overfitting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove constant columns&lt;/li&gt;
&lt;li&gt;Remove highly correlated features&lt;/li&gt;
&lt;li&gt;Domain knowledge–based selection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
 Predicting house price:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Size&lt;/li&gt;
&lt;li&gt;✅ Location&lt;/li&gt;
&lt;li&gt;❌ Owner name&lt;/li&gt;
&lt;li&gt;❌ Phone number&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less noise&lt;/li&gt;
&lt;li&gt;Faster training&lt;/li&gt;
&lt;li&gt;Better accuracy&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;u&gt;Outlier Handling&lt;/u&gt;&lt;br&gt;
&lt;strong&gt;Outliers&lt;/strong&gt; distort learning.&lt;br&gt;
Common approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove extreme values&lt;/li&gt;
&lt;li&gt;Cap values (winsorization)&lt;/li&gt;
&lt;li&gt;Use robust scalers&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Models like tree-based algorithms are less sensitive.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Outliers are not always wrong!&lt;/strong&gt;&lt;br&gt;
Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Billionaires exist&lt;/li&gt;
&lt;li&gt;Olympic athletes exist&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Options:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove (if error)&lt;/li&gt;
&lt;li&gt;Cap (limit max/min)&lt;/li&gt;
&lt;li&gt;Keep (if meaningful)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Models affected:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linear models → very sensitive&lt;/li&gt;
&lt;li&gt;Tree models → less sensitive&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;u&gt;Data Leakage (CRITICAL CONCEPT)&lt;/u&gt;&lt;br&gt;
&lt;strong&gt;What is it?&lt;/strong&gt;&lt;br&gt;
 Using information in training that wouldn’t be available in real life or using &lt;strong&gt;future or test information&lt;/strong&gt; during training.&lt;br&gt;
🚫 Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scaling before train-test split&lt;/li&gt;
&lt;li&gt;Filling missing values using entire dataset&lt;/li&gt;
&lt;li&gt;Using future data to predict past&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt;  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All preprocessing decisions must be learned from &lt;strong&gt;training data only&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;❌ Bad example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scaling entire dataset before split&lt;/li&gt;
&lt;li&gt;Finding mean using full data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model secretly sees test data ❌&lt;/p&gt;

&lt;p&gt;✅ Correct way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Split data&lt;/li&gt;
&lt;li&gt;Learn statistics from training&lt;/li&gt;
&lt;li&gt;Apply to test&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Final Mental Model (Remember this):&lt;br&gt;
&lt;code&gt;Clean data → Fair split → Honest training → Reliable model&lt;/code&gt;&lt;/p&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Typical ML Preprocessing Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7p8lc1qsc5296tequt76.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7p8lc1qsc5296tequt76.png" alt="ML Preprocessing Pipeline" width="641" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To summarize:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Data preprocessing is where ML models are made or broken — it’s more important than the algorithm itself.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>100daysofcode</category>
      <category>mlbasics</category>
      <category>datapreprocessing</category>
    </item>
    <item>
      <title>The next basic concept of Machine Learning after NumPy: Pandas</title>
      <dc:creator>Juhi Kushwah</dc:creator>
      <pubDate>Mon, 05 Jan 2026 07:46:20 +0000</pubDate>
      <link>https://dev.to/juhikushwah/the-next-basic-concept-of-machine-learning-after-numpy-pandas-4h4a</link>
      <guid>https://dev.to/juhikushwah/the-next-basic-concept-of-machine-learning-after-numpy-pandas-4h4a</guid>
      <description>&lt;p&gt;&lt;em&gt;The emphasis on NumPy in the heading, despite this post focusing on the Pandas library, reflects my intent to document my iterative learning journey on this platform as part of the #100DaysOfCode challenge. Additional information on NumPy can be found here:&lt;/em&gt; &lt;a href="https://dev.to/juhikushwah/understanding-numpy-in-the-context-of-python-for-machine-learning-43i7"&gt;Understanding NumPy in the context of Python for Machine Learning&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After &lt;strong&gt;NumPy&lt;/strong&gt;, the next basic concept for Machine Learning is &lt;strong&gt;Pandas&lt;/strong&gt;, followed closely by &lt;strong&gt;data preprocessing concepts&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let me explain this as a clear learning path, not just a list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Recap NumPy&lt;/u&gt;&lt;/strong&gt;&lt;br&gt;
We learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Arrays &amp;amp; matrices&lt;/li&gt;
&lt;li&gt;Vectorized operations&lt;/li&gt;
&lt;li&gt;Basic linear algebra&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the &lt;strong&gt;math engine&lt;/strong&gt; of ML.&lt;/p&gt;
&lt;h2&gt;
  
  
  Next Core Concept: Pandas
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is Pandas?&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Pandas&lt;/strong&gt; is a Python library for &lt;strong&gt;data handling and analysis&lt;/strong&gt;.&lt;br&gt;
While NumPy handles numbers, &lt;strong&gt;Pandas handles real-world datasets&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In ML, &lt;strong&gt;most of your time (~70%) is spent on data&lt;/strong&gt;, not modeling.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Why Pandas Comes Next in ML?&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;u&gt;Real ML data is messy&lt;/u&gt;&lt;br&gt;
Datasets usually come as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSV / Excel / JSON files&lt;/li&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Mixed data types (numbers + text)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pandas makes this easier:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
df = pd.read_csv("data.csv")
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;u&gt;Data cleaning &amp;amp; preprocessing (CRUCIAL for ML)&lt;/u&gt;&lt;br&gt;
This is where ML actually begins.&lt;br&gt;
Common tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handling missing values&lt;/li&gt;
&lt;li&gt;Encoding categorical variables&lt;/li&gt;
&lt;li&gt;Feature selection&lt;/li&gt;
&lt;li&gt;Filtering rows/columns
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; df.isnull()
 df.dropna()
 df.fillna(df.mean())
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;u&gt;Bridge between raw data and ML models&lt;/u&gt;&lt;br&gt;
ML libraries (scikit-learn) expect NumPy arrays.&lt;/p&gt;

&lt;p&gt;Pandas makes conversion seamless:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt; X = df[['Age', 'Salary']].values
 y = df['Purchased'].values
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;u&gt;Tabular data representation (DataFrames)&lt;/u&gt;&lt;br&gt;&lt;br&gt;
Pandas introduces DataFrame (like an Excel table):  &lt;/p&gt;&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kgazkh4tleshl6ljz4j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kgazkh4tleshl6ljz4j.png" alt="Sample Excel data" width="645" height="117"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     df.head()
     df.columns
     df.shape
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;One-line takeaway&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;After NumPy, learn Pandas — because Machine Learning starts with data, not models.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>100daysofcode</category>
      <category>mlbasics</category>
      <category>pandas</category>
    </item>
    <item>
      <title>Understanding NumPy in the context of Python for Machine Learning</title>
      <dc:creator>Juhi Kushwah</dc:creator>
      <pubDate>Sun, 04 Jan 2026 07:53:10 +0000</pubDate>
      <link>https://dev.to/juhikushwah/understanding-numpy-in-the-context-of-python-for-machine-learning-43i7</link>
      <guid>https://dev.to/juhikushwah/understanding-numpy-in-the-context-of-python-for-machine-learning-43i7</guid>
      <description>&lt;p&gt;&lt;em&gt;As a coder, I’ve always felt there’s a lot of chaos around AI and ML, even among those who use these abbreviations interchangeably while understanding them conceptually. I’m restarting my journey in the field of machine learning and plan to log my learning as part of the #100DaysOfCode challenge on this platform. Please feel free to share your insights and correct me if needed.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is NumPy(Numerical Python)?
&lt;/h2&gt;

&lt;p&gt;NumPy is a core python library used for &lt;strong&gt;fast numerical computing&lt;/strong&gt;. It provides a powerful object called the &lt;strong&gt;ndarray&lt;/strong&gt;, which is essentially a highly optimized array for mathematical operations.&lt;/p&gt;

&lt;p&gt;In ML, NumPy is foundational for almost every ML library such as &lt;em&gt;TensorFlow, PyTorch, scikit-learn, Pandas,&lt;/em&gt; etc. depends on it under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is NumPy essential for Machine Learning?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Efficient numerical operations&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NumPy is much faster than python lists(If you have an understanding of Python lists).&lt;/li&gt;
&lt;li&gt;Vectorized operations are supported (performing operations on entire arrays at once)&lt;/li&gt;
&lt;li&gt;Example:
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; a + b           # element-wise sum
 a * b           # element-wise multiplication
 np.dot(a, b)    # matrix multiplication
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Powerful support for Linear Algebra&lt;br&gt;
ML algorithms rely heavily on operations like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Matrix multiplication&lt;/li&gt;
&lt;li&gt;Matrix inverse&lt;/li&gt;
&lt;li&gt;Norms&lt;/li&gt;
&lt;li&gt;Eigenvalues&lt;/li&gt;
&lt;li&gt;Dot products&lt;/li&gt;
&lt;li&gt;NumPy provides fast implementations through some of these functions:
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; np.dot()
 np.linalg.inv()
 np.linalg.eig()
 np.linalg.norm()
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Foundation for data structures in ML&lt;br&gt;
Training data is usually represented as NumPy arrays:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Features matrix (X)&lt;/strong&gt;: shape = &lt;code&gt;(n_samples, n_features)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Labels vector (y)&lt;/strong&gt;: shape = &lt;code&gt;(n_samples, )&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Example:
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; X = np.array([[1,2],[3,4],[5,6]])
 y = np.array([0,1,0])
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Bridge between ML libraries&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Libraries like &lt;strong&gt;scikit-learn&lt;/strong&gt;, &lt;strong&gt;TensorFlow&lt;/strong&gt;, and &lt;strong&gt;Pandas&lt;/strong&gt; internally convert data to NumPy arrays.&lt;/li&gt;
&lt;li&gt;Example:
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; import pandas as pd
 df = pd.read_csv("data.csv")
 X = df.values   # becomes a NumPy array
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Random number generation &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This is crucial for ML for: &lt;strong&gt;weight initialization, shuffling data, train/test splits&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Example:
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; np.random.seed(42)
 weights = np.random.randn(3,3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Where do you use NumPy in ML?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; Data preprocessing&lt;br&gt;
&lt;strong&gt;How NumPy helps:&lt;/strong&gt; slicing, shaping, normalization&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; Implementing ML algorithms from scratch&lt;br&gt;
&lt;strong&gt;How NumPy helps:&lt;/strong&gt; vectorized math for speed&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; Train-test splits&lt;br&gt;
&lt;strong&gt;How NumPy helps:&lt;/strong&gt; shuffling and indexing&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; Model evaluation&lt;br&gt;
&lt;strong&gt;How NumPy helps:&lt;/strong&gt; vectorized loss calculation&lt;/p&gt;

&lt;p&gt;&lt;em&gt;To summarize, NumPy is the mathematical backbone of Python Machine Learning—providing fast arrays, linear algebra tools, random generators, and vectorized operations that all ML workflows rely on.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>100daysofcode</category>
      <category>mlbasics</category>
      <category>numpy</category>
    </item>
  </channel>
</rss>
