DEV Community: Juhi Kushwah

How does a machine actually learn from data?

Juhi Kushwah — Wed, 14 Jan 2026 07:57:30 +0000

I was discussing this with my co-worker (who is also an ML engineer) as to how a beginner like me should approach machine learning? She said now that I've intentionally mastered NumPy → Pandas → Data Preprocessing conceptually, the next concept should NOT be “more tools”.

It should be ML thinking itself!

Her suggestion, somehow, did not sit well with me—partly because there are endless tools if you think about it! I had narrowed things down to NumPy, Pandas, Data Preprocessing and Scikit-learn (I haven’t covered this topic yet, for reasons I’ll explain as we dive deeper into this post) based on my own understanding of the subject. However, what she said next made more sense to me, because this is where my perspective as a software engineer comes into play—it’s important to understand the mental model behind algorithms.

If you are an iterative learner like me, you're right to pause here and think about why we shouldn't jump into scikit-learn before understanding how learning itself works?

Short answer(the important one): learn just enough scikit-learn, but after you understand how learning works.

Let me elaborate on this:

🎯 The Correct Order (Beginner-Optimal)
You should NOT fully learn scikit-learn before understanding:

what a model is
what loss is
what training means
what overfitting is

Otherwise, scikit-learn becomes a black box.

🧠 Think of scikit-learn like this

Concepts → why something works
scikit-learn → how to apply it quickly

If you reverse this order:

model = LinearRegression()
model.fit(X, y)

You can run code — but you don’t actually know what happened!

why it works?
when it fails?
what assumptions it makes?

Instead, you(a beginner) should learn learning types + core ML ideas.

✅ What You SHOULD do instead (Best approach)
Step 1️⃣ — Learn learning concepts (NO scikit-learn yet)
(This is what we are already doing)

Learn conceptually:

Supervised learning
Regression vs classification
Model = function
Loss function
Overfitting vs underfitting
Train vs test behavior

👉 This can be done with math intuition + NumPy.

Step 2️⃣ — Implement Linear Regression from scratch

Using:

NumPy
A few lines of math
No ML libraries

This answers:

“How does the model actually learn?”

Step 3️⃣ — THEN introduce scikit-learn (lightly)

Once the concept clicks, scikit-learn becomes:

Clean
Logical
Easy

You’ll instantly understand:

.fit()
.predict()
.score()

❌ What NOT to do (common beginner mistake)
❌ Deep dive into scikit-learn API
❌ Memorize classifiers and parameters
❌ Jump to advanced models too early

This creates fragile understanding.

🧭 Minimal scikit-learn you may peek at (optional)

It’s okay to recognize these, not master them yet:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

(You already used these in previous posts.)

But don’t learn models yet.

🎯 The Next Beginner ML Concept: Supervised Learning Fundamentals

🔑 Concept 1: Types of Machine Learning

1️⃣ Supervised Learning (START HERE)

You have:

Input features (X)
Correct answers (y)

Examples:

Predict salary → regression
Predict spam/not spam → classification

This is 90% of beginner ML.

2️⃣ Unsupervised Learning (later)

No labels.
Model finds structure itself.

Examples:
Customer segmentation: → “Group similar customers”
Clustering→ “The method used to form those groups”

3️⃣ Reinforcement Learning (much later)

Agent learns via rewards.

📌 For now: Focus ONLY on Supervised Learning.

🔑 Concept 2: Regression vs Classification

🟦 Regression

Predict a number.

House price → $250,000
Temperature → 28.5°C

🟥 Classification

Predict a category.

Spam / Not Spam
Yes / No

🧠 Tiny mental exercise
Which is which?

| Problem            | Type           |
| ------------------ | -------------- |
| Predict exam score | Regression     |
| Predict pass/fail  | Classification |

🔑 Concept 3: Model, Parameters & Learning

🧠 What is a model?
A mathematical function that maps:

X → y

Example:

y = w*x + b

w → weight (importance)
b → bias (offset)

Learning = finding best w and b.

🔑 Concept 4: Loss Function (VERY IMPORTANT)

🧠 What is loss?

“How wrong is the model?”

Example:

True value = 100
Prediction = 90
Error = 10

Loss function quantifies this error.

Common:

Mean Squared Error (MSE)

🔑 Concept 5: Training vs Prediction

Training phase:

Model sees data
Adjusts parameters
Minimizes loss

Prediction phase:

Model is frozen
Makes predictions on new data

🔑 Concept 6: Overfitting vs Underfitting

Underfitting:

Model too simple
Misses patterns

Overfitting:

Model memorizes data
Fails on new data

📌 This is the heart of ML.

🔑 Concept 7: Evaluation Metrics (Conceptual)

You don’t evaluate a model on training data.

Examples:

Regression → MSE, RMSE, R²
Classification → Accuracy, Precision, Recall

(You’ll learn these slowly — concept first.)

I know I’ve introduced a few advanced terms at a beginner level to give an idea of what the roadmap to understanding machine learning looks like. Don’t worry if they feel unfamiliar right now — I’ll be exploring each of these topics in depth as we go.

You can refer to these posts for understanding NumPy, Pandas and Data Preprocessing:
Understanding NumPy in the context of Python for Machine Learning
The next basic concept of Machine Learning after NumPy: Pandas
Understanding Data Preprocessing
Beginner-friendly exercises on NumPy, Pandas and Data Preprocessing

Beginner-friendly exercises on NumPy, Pandas and Data Preprocessing

Juhi Kushwah — Thu, 08 Jan 2026 07:52:45 +0000

Before diving deep into Machine Learning, I would like to share tiny, beginner-friendly code-based exercises based on NumPy, Pandas and Data Preprocessing - small, focused and ML oriented.

✅ NumPy Mini Exercises (Level: Very Easy)

Make sure you import NumPy first:

import numpy as np

Exercise 1 — Create a NumPy array

Create a NumPy array containing these numbers:
[2, 4, 6, 8]

Solution:

a = np.array([2, 4, 6, 8])

Exercise 2 — Create a 2D array
Create this 2×2 matrix:

1 2
3 4

Solution:

m = np.array([[1, 2],
              [3, 4]])

Exercise 3 — Array shape
Find the shape of this array:
a = np.array([[10, 20, 30], [40, 50, 60]])

Solution:

a = np.array([[10, 20, 30], [40, 50, 60]])
a.shape

Output:

(2, 3)

Exercise 4 — Element-wise operations
Given:

a = np.array([1, 2, 3])
b = np.array([10, 20, 30])

Compute:

a + b
a * b

Solution:
I. Addition

a + b

Output:

array([11, 22, 33])

II. Multiplication

a * b

Output:

array([10, 40, 90])

Exercise 5 — Slicing
Given:
a = np.array([5, 10, 15, 20, 25])

Extract the middle three values:
[10, 15, 20]

Solution:

a = np.array([5, 10, 15, 20, 25])
middle = a[1:4]

Output:

array([10, 15, 20])

Exercise 6 — Zero and Ones arrays
Create:

A 3×3 matrix of zeros
A 2×4 matrix of ones

Solution:
I. 3×3 matrix of zeros

np.zeros((3, 3))

II. 2×4 matrix of ones

np.ones((2, 4))

Exercise 7 — Random numbers
Generate a NumPy array of five random numbers between 0 and 1.

Solution:

r = np.random.rand(5)

Output:

array([0.23, 0.91, 0.49, 0.11, 0.76])

Exercise 8 — Matrix multiplication
Given:

A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

Compute:
A @ B
(or np.dot(A, B))

Solution:

A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

A @ B   # or np.dot(A, B)

Output:

array([[19, 22],
       [43, 50]])

Exercise 9 — Mean of an array
Compute the mean of:
x = np.array([4, 8, 12, 16])

Solution:

x = np.array([4, 8, 12, 16])
np.mean(x)

Output:

10.0

Exercise 10 — Reshape
Given:
x = np.array([1, 2, 3, 4, 5, 6])

Reshape it into a 2×3 matrix.

Solution:

x = np.array([1, 2, 3, 4, 5, 6])
x.reshape(2, 3)

Output:

array([[1, 2, 3],
       [4, 5, 6]])

📊 Pandas Mini Exercises (Level: Very Easy)

Start with:

import pandas as pd

Exercise 1 — Create a DataFrame
Create a DataFrame from this dictionary:

data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}

Solution:

data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
df

🧠 Explanation

Dictionary keys → column names
Lists → column values

Output:

    Age    Salary
0   25     50000
1   30     60000
2   35     70000

Exercise 2 — View data
Using the DataFrame from Exercise 1:

Display the first 2 rows
Display the column names
Display the shape of the DataFrame

Solution:

df.head(2)
df.columns
df.shape

🧠 Explanation

head(2) → first 2 rows
columns → column names
shape → (rows, columns)

Output:

#first 2 rows
   Age  Salary  
0   25   50000
1   30   60000

#column names
Index(['Age', 'Salary'], dtype='object')

#3 rows, 2 columns
(3, 2)

Exercise 3 — Select a column
Select only the Salary column.

Solution:

df["Salary"]

🧠 Explanation

Single brackets → returns a Series

Output:

#This is a Series, not a DataFrame

0    50000
1    60000
2    70000
Name: Salary, dtype: int64

Exercise 4 — Select multiple columns
Select Age and Salary together.

Solution:

df[["Age", "Salary"]]

🧠 Explanation

Double brackets → returns a DataFrame

Output:

#Double brackets → DataFrame

   Age  Salary
0   25   50000
1   30   60000
2   35   70000

Exercise 5 — Filter rows
From the DataFrame, select rows where:

Age > 28

Solution:

df[df["Age"] > 28]

🧠 Explanation

Boolean condition filters rows; it is a core Pandas skill
Very common in data cleaning

Output:

   Age  Salary
1   30   60000
2   35   70000

Exercise 6 — Add a new column
Add a column called Tax which is 10% of Salary.

Solution:

df["Tax"] = 0.10 * df["Salary"]
df

🧠 Explanation

Pandas supports vectorized operations
Applied to entire column at once

Output:

# Operations apply row-wise automatically

   Age  Salary     Tax
0   25   50000  5000.0
1   30   60000  6000.0
2   35   70000  7000.0

Exercise 7 — Basic statistics
Compute:

Mean Age
Maximum Salary

Solution:

df["Age"].mean()
df["Salary"].max()

🧠 Explanation

Pandas has built-in descriptive stats
Used heavily during EDA

Output:

# Mean Age
30.0

#Maximum Salary
70000

Exercise 8 — Handle missing values
Given:

data = {
    "Age": [25, None, 35],
    "Salary": [50000, 60000, None]
}
df = pd.DataFrame(data)

Detect missing values
Fill missing values with the column mean

Solution:

df.isnull()
df_filled = df.fillna(df.mean())
df

🧠 Explanation

isnull() → detects missing values
fillna(df.mean()) → fills numeric NaNs with column mean

Output:

    Age   Salary
0  25.0  50000.0
1   NaN  60000.0
2  35.0      NaN

Breaking this down here:
✅ Detect missing values

df.isnull()

Output:

     Age  Salary
0  False   False
1   True   False
2  False    True

✅ Fill missing values with mean

df_filled = df.fillna(df.mean())
df_filled

Output:

    Age   Salary
0  25.0  50000.0
1  30.0  60000.0
2  35.0  55000.0

🧠 Means used:

Age mean = 30
Salary mean = 55,000

Exercise 9 — Sort values
Sort the DataFrame by Salary (descending order).

Solution:

df.sort_values(by="Salary", ascending=False)

🧠 Explanation

Sorting helps identify top/bottom values
Common during analysis

Output:

    Age   Salary
1  30.0  60000.0
2  35.0  55000.0
0  25.0  50000.0

Exercise 10 — Convert to NumPy (ML step)
Convert:

Features → Age, Salary
Target → Tax

into NumPy arrays.

Solution:

X = df[["Age", "Salary"]].values
y = df["Tax"].values

X

🧠 Explanation

.values converts Pandas → NumPy
scikit-learn expects NumPy arrays

Output:

array([[2.5e+01, 5.0e+04],
       [3.0e+01, 6.0e+04],
       [3.5e+01, 5.5e+04]])

✅ Target

Output:

array([50000., 60000., 55000.])

🧠 This is exactly the format ML models expect

🧪 Data Preprocessing: Code-Based Mini Exercises

Start with:

import pandas as pd
import numpy as np

Exercise 1 — Train/Test Split (Ratio practice)
Given:

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])

👉 Split the data into 80% training and 20% testing.
Use:

from sklearn.model_selection import train_test_split

Solution:

from sklearn.model_selection import train_test_split

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

🧠 Explanation

test_size=0.2 → 20% test, 80% train
random_state ensures reproducibility
Model learns from X_train, evaluated on X_test

Output (one possible split with random_state=42):

X_train = [[4], [2], [5], [3]]
X_test  = [[1]]

y_train = [40, 20, 50, 30]
y_test  = [10]

Exercise 2 — Detect missing values
Given:

data = {
    "Age": [25, 30, None, 40],
    "Salary": [50000, None, 70000, 80000]
}
df = pd.DataFrame(data)

👉 Write code to:

Detect missing values
Count missing values per column

Solution:

df.isnull()

df.isnull().sum()

🧠 Explanation

isnull() → True/False for each cell
sum() counts missing values per column

Output of df.isnull():

     Age  Salary
0  False   False
1  False    True
2   True   False
3  False   False

Output of df.isnull().sum():

Age       1
Salary    1
dtype: int64

Exercise 3 — Fill missing values (Mean)
Using the same DataFrame above:
👉 Fill missing values using column mean.

Solution:

df_filled = df.fillna(df.mean())

🧠 Explanation

Replaces NaN with column mean
Common for numerical ML features
Keeps dataset size intact

Output:

    Age   Salary
0  25.0  50000.0
1  30.0  66666.7
2  31.7  70000.0
3  40.0  80000.0

(Mean values used)

Exercise 4 — One-Hot Encoding (Categorical Data)
Given:

df = pd.DataFrame({
    "City": ["Delhi", "Mumbai", "Delhi", "Chennai"]
})

👉 Convert City into numerical columns using one-hot encoding.

Solution:

encoded_df = pd.get_dummies(df["City"])

OR keep original structure:

encoded_df = pd.get_dummies(df, columns=["City"])

🧠 Explanation

Converts text categories into binary columns
Avoids false numeric ordering
Required before ML models

Input:

City
Delhi
Mumbai
Delhi
Chennai

Output:

   City_Chennai  City_Delhi  City_Mumbai
0             0           1             0
1             0           0             1
2             0           1             0
3             1           0             0

Exercise 5 — Feature Scaling (Standardization)
Given:

X = np.array([
    [20, 30000],
    [30, 50000],
    [40, 70000]
])

👉 Apply Standard Scaling to X.

Use:

from sklearn.preprocessing import StandardScaler

Solution:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

🧠 Explanation

Centers data around mean = 0
Std deviation = 1
Essential for distance-based models

Output(approx):

[[-1.2247, -1.2247],
 [ 0.0000,  0.0000],
 [ 1.2247,  1.2247]]

Exercise 6 — Feature Selection
Given:

df = pd.DataFrame({
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000],
    "EmployeeID": [101, 102, 103]
})

👉 Remove the EmployeeID column.

Solution:

df_selected = df.drop("EmployeeID", axis=1)

🧠 Explanation

IDs carry no predictive value
Removing noise improves model learning

Input columns:

['Age', 'Salary', 'EmployeeID']

Output columns:

['Age', 'Salary']

Exercise 7 — Outlier Detection (Simple logic)
Given:
ages = np.array([22, 23, 24, 25, 120])

👉 Write code to remove values greater than 100.

Solution:

filtered_ages = ages[ages <= 100]

🧠 Explanation

Simple rule-based filtering
Useful for obvious data errors
Always inspect before removing

Output:

[22, 23, 24, 25]

Exercise 8 — Data Leakage Check (Thinking + Code)
Given:

from sklearn.preprocessing import StandardScaler

👉 Write the correct order of code to:

Split data
Fit scaler on training data
Transform both training and test data

(No need to run it — just write the correct sequence.)

Solution:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 2. Fit scaler ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# 3. Apply same scaler to test data
X_test_scaled = scaler.transform(X_test)

🧠 Explanation

Training data defines statistics
Test data must remain unseen
Prevents unrealistically high accuracy

Output (conceptual sequence):

1. Split data
2. Fit scaler on training data
3. Transform training data
4. Transform test data

If you can do these exercises comfortably, you’re ML-ready at a foundational level.

If you are new to python, you can install python version 3.x and try playing around with these exercises in your IDE. I use Jupyter notebook.

📌Recommendation (if you're a beginner):

Do NOT learn scikit-learn models yet.
First learn how a model learns.
Then use scikit-learn as a tool, not a teacher.

Will be exploring more on this in subsequent posts.

Understanding Data Preprocessing

Juhi Kushwah — Wed, 07 Jan 2026 07:48:11 +0000

Data Preprocessing - this is exactly the right next step after Pandas. Think of Data Preprocessing as the bridge between raw data and usable ML input. You can find more information Pandas here: The next basic concept of Machine Learning after NumPy: Pandas

What is Data Preprocessing in Machine Learning?

Data preprocessing is the process of cleaning, transforming, and preparing data so that a machine learning model can learn from it effectively.

A model can only be as good as the data you feed it.

Why Data Preprocessing is Critical?
Raw data usually has:

Missing values
Different scales (Age vs Salary)
Categorical text values
Noise & irrelevant features

ML algorithms assume clean, numerical, well-scaled data.

Core Data Preprocessing Concepts (Must-Know)

Train–Test Split

Concept: We don’t train and evaluate on the same data.
- Training set → learn patterns
- Test set → evaluate performance
What does it mean?
We divide data into:
- Training data → teaches the model
- Testing data → checks how well it learned
Typical split:
- 80% train / 20% test
```
 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
Why 80:20 or 70:30?
Imagine you have 100 exam questions:
- You practice with 80 questions
- You test yourself with 20 new ones

If you test on questions you already practiced → false confidence.
Common ratios:
- 80% train / 20% test → most common
- 70% / 30% → small datasets
- 90% / 10% → very large datasets

If you use too much training:
- Test set too small → unreliable evaluation

If you use too much testing:
- Model doesn’t learn enough

Why it matters: Prevents overfitting and gives realistic performance.
Handling Missing Values
Problem: ML models cannot work with NaN

Common strategies:
- Remove rows/columns (small datasets → risky)
- Replace with:
  - Mean / Median (numerical)
  - Mode (categorical)
```
 df.fillna(df.mean())
```
Rule of thumb:
- Use median if data has outliers
- Never fill test data using test statistics (data leakage!)
What is an outlier?
A value that is very different from the rest.
Example:
Salaries in a company:
[45k, 48k, 50k, 52k, 49k, 2,000k]

That 2,000k (2 million) salary:
-Skews the average
-Confuses the model

Why it’s bad?
- Mean salary becomes unrealistic
- Model learns wrong patterns
What we do:
- Remove it
- Cap it
- Use median instead of mean
Encoding Categorical Variables
Problem: ML models only understand numbers, not texts.
Types:
- Label Encoding → ordered categories
- One-Hot Encoding → unordered categories (most common) pd.get_dummies(df["City"])
Example:
City = ["Delhi", "New York City", "Delhi"]

❌** Wrong Way:**
```
Delhi = 1
New York City = 2
```
Model thinks New York City > Delhi ❌ (no meaning!)

✅ Correct way: One-Hot Encoding
Create separate columns:
[Delhi] = [1, 0, 1]
[New York City] = [0, 1, 0]

Now:
- No false ordering
- Model understands categories correctly
Key idea:
Never give false numeric meaning to categories.
Feature Scaling
Problem: Features have different ranges. Different features have different scales.
Example 1:
- Age → 0–100
- Salary → 0–100000
This breaks distance-based models (KNN, SVM).

Example 2:
- Age → 18–60
- Salary → 20,000–200,000
Model pays more attention to Salary just because numbers are bigger ❌

✅ Solution: Scaling
Bring all values to similar ranges.

Two common methods:
- Standardization → most used
- Normalization → 0 to 1 range
🔹 Standardization (most used) = (x − mean) / std
```
   from sklearn.preprocessing import StandardScaler
   scaler = StandardScaler()
   X_scaled = scaler.fit_transform(X)
```
🔹 Normalization = (x − min) / (max − min)

Important:
- Fit scaler on training data only
- Apply the same transformation to test data
Feature Selection
Goal: Keep only useful features and remove useless ones.
Why?
- Reduces noise
- Improves performance
- Avoids overfitting
Examples:
- Remove constant columns
- Remove highly correlated features
- Domain knowledge–based selection
Example:
Predicting house price:
- ✅ Size
- ✅ Location
- ❌ Owner name
- ❌ Phone number
Why?
- Less noise
- Faster training
- Better accuracy
Outlier Handling
Outliers distort learning.
Common approaches:
- Remove extreme values
- Cap values (winsorization)
- Use robust scalers
Models like tree-based algorithms are less sensitive.

Outliers are not always wrong!
Examples:
- Billionaires exist
- Olympic athletes exist
Options:
- Remove (if error)
- Cap (limit max/min)
- Keep (if meaningful)
Models affected:
- Linear models → very sensitive
- Tree models → less sensitive
Data Leakage (CRITICAL CONCEPT)
What is it?
Using information in training that wouldn’t be available in real life or using future or test information during training.
🚫 Examples:
- Scaling before train-test split
- Filling missing values using entire dataset
- Using future data to predict past
Rule:

All preprocessing decisions must be learned from training data only.

❌ Bad example:
- Scaling entire dataset before split
- Finding mean using full data
Model secretly sees test data ❌

✅ Correct way:
- Split data
- Learn statistics from training
- Apply to test
Final Mental Model (Remember this):
Clean data → Fair split → Honest training → Reliable model

Typical ML Preprocessing Pipeline

To summarize:

Data preprocessing is where ML models are made or broken — it’s more important than the algorithm itself.

The next basic concept of Machine Learning after NumPy: Pandas

Juhi Kushwah — Mon, 05 Jan 2026 07:46:20 +0000

The emphasis on NumPy in the heading, despite this post focusing on the Pandas library, reflects my intent to document my iterative learning journey on this platform as part of the #100DaysOfCode challenge. Additional information on NumPy can be found here: Understanding NumPy in the context of Python for Machine Learning

After NumPy, the next basic concept for Machine Learning is Pandas, followed closely by data preprocessing concepts.

Let me explain this as a clear learning path, not just a list.

Recap NumPy
We learned:

Arrays & matrices
Vectorized operations
Basic linear algebra

This is the math engine of ML.

Next Core Concept: Pandas

What is Pandas?
Pandas is a Python library for data handling and analysis.
While NumPy handles numbers, Pandas handles real-world datasets.

In ML, most of your time (~70%) is spent on data, not modeling.

Why Pandas Comes Next in ML?

Real ML data is messy
Datasets usually come as:
- CSV / Excel / JSON files
- Missing values
- Mixed data types (numbers + text)
Pandas makes this easier:
```
import pandas as pd
df = pd.read_csv("data.csv")
```
Data cleaning & preprocessing (CRUCIAL for ML)
This is where ML actually begins.
Common tasks:
- Handling missing values
- Encoding categorical variables
- Feature selection
- Filtering rows/columns
```
 df.isnull()
 df.dropna()
 df.fillna(df.mean())
```
Bridge between raw data and ML models
ML libraries (scikit-learn) expect NumPy arrays.

Pandas makes conversion seamless:
```
 X = df[['Age', 'Salary']].values
 y = df['Purchased'].values
```
Tabular data representation (DataFrames)

Pandas introduces DataFrame (like an Excel table):

     df.head()
     df.columns
     df.shape

One-line takeaway

After NumPy, learn Pandas — because Machine Learning starts with data, not models.

Understanding NumPy in the context of Python for Machine Learning

Juhi Kushwah — Sun, 04 Jan 2026 07:53:10 +0000

As a coder, I’ve always felt there’s a lot of chaos around AI and ML, even among those who use these abbreviations interchangeably while understanding them conceptually. I’m restarting my journey in the field of machine learning and plan to log my learning as part of the #100DaysOfCode challenge on this platform. Please feel free to share your insights and correct me if needed.

What is NumPy(Numerical Python)?

NumPy is a core python library used for fast numerical computing. It provides a powerful object called the ndarray, which is essentially a highly optimized array for mathematical operations.

In ML, NumPy is foundational for almost every ML library such as TensorFlow, PyTorch, scikit-learn, Pandas, etc. depends on it under the hood.

Why is NumPy essential for Machine Learning?

Efficient numerical operations
- NumPy is much faster than python lists(If you have an understanding of Python lists).
- Vectorized operations are supported (performing operations on entire arrays at once)
- Example:
```
 a + b           # element-wise sum
 a * b           # element-wise multiplication
 np.dot(a, b)    # matrix multiplication
```
Powerful support for Linear Algebra
ML algorithms rely heavily on operations like:
- Matrix multiplication
- Matrix inverse
- Norms
- Eigenvalues
- Dot products
- NumPy provides fast implementations through some of these functions:
```
 np.dot()
 np.linalg.inv()
 np.linalg.eig()
 np.linalg.norm()
```
Foundation for data structures in ML
Training data is usually represented as NumPy arrays:
- Features matrix (X): shape = (n_samples, n_features)
- Labels vector (y): shape = (n_samples, )
- Example:
```
 X = np.array([[1,2],[3,4],[5,6]])
 y = np.array([0,1,0])
```
Bridge between ML libraries
- Libraries like scikit-learn, TensorFlow, and Pandas internally convert data to NumPy arrays.
- Example:
```
 import pandas as pd
 df = pd.read_csv("data.csv")
 X = df.values   # becomes a NumPy array
```
Random number generation
- This is crucial for ML for: weight initialization, shuffling data, train/test splits
- Example:
```
 np.random.seed(42)
 weights = np.random.randn(3,3)
```

Where do you use NumPy in ML?

Task: Data preprocessing
How NumPy helps: slicing, shaping, normalization

Task: Implementing ML algorithms from scratch
How NumPy helps: vectorized math for speed

Task: Train-test splits
How NumPy helps: shuffling and indexing

Task: Model evaluation
How NumPy helps: vectorized loss calculation

To summarize, NumPy is the mathematical backbone of Python Machine Learning—providing fast arrays, linear algebra tools, random generators, and vectorized operations that all ML workflows rely on.