DEV Community

Juhi Kushwah
Juhi Kushwah

Posted on • Edited on

Beginner-friendly exercises on NumPy, Pandas and Data Preprocessing

Before diving deep into Machine Learning, I would like to share tiny, beginner-friendly code-based exercises based on NumPy, Pandas and Data Preprocessing - small, focused and ML oriented.

βœ… NumPy Mini Exercises (Level: Very Easy)

Make sure you import NumPy first:

import numpy as np
Enter fullscreen mode Exit fullscreen mode

Exercise 1 β€” Create a NumPy array

Create a NumPy array containing these numbers:
[2, 4, 6, 8]

Solution:

a = np.array([2, 4, 6, 8])
Enter fullscreen mode Exit fullscreen mode

Exercise 2 β€” Create a 2D array
Create this 2Γ—2 matrix:

1 2
3 4
Enter fullscreen mode Exit fullscreen mode

Solution:

m = np.array([[1, 2],
              [3, 4]])
Enter fullscreen mode Exit fullscreen mode

Exercise 3 β€” Array shape
Find the shape of this array:
a = np.array([[10, 20, 30], [40, 50, 60]])

Solution:

a = np.array([[10, 20, 30], [40, 50, 60]])
a.shape
Enter fullscreen mode Exit fullscreen mode

Output:

(2, 3)
Enter fullscreen mode Exit fullscreen mode

Exercise 4 β€” Element-wise operations
Given:

a = np.array([1, 2, 3])
b = np.array([10, 20, 30])
Enter fullscreen mode Exit fullscreen mode

Compute:

  1. a + b
  2. a * b

Solution:
I. Addition

a + b
Enter fullscreen mode Exit fullscreen mode

Output:

array([11, 22, 33])
Enter fullscreen mode Exit fullscreen mode

II. Multiplication

a * b
Enter fullscreen mode Exit fullscreen mode

Output:

array([10, 40, 90])
Enter fullscreen mode Exit fullscreen mode

Exercise 5 β€” Slicing
Given:
a = np.array([5, 10, 15, 20, 25])

Extract the middle three values:
[10, 15, 20]

Solution:

a = np.array([5, 10, 15, 20, 25])
middle = a[1:4]
Enter fullscreen mode Exit fullscreen mode

Output:

array([10, 15, 20])
Enter fullscreen mode Exit fullscreen mode

Exercise 6 β€” Zero and Ones arrays
Create:

  1. A 3Γ—3 matrix of zeros
  2. A 2Γ—4 matrix of ones

Solution:
I. 3Γ—3 matrix of zeros

np.zeros((3, 3))
Enter fullscreen mode Exit fullscreen mode

II. 2Γ—4 matrix of ones

np.ones((2, 4))
Enter fullscreen mode Exit fullscreen mode

Exercise 7 β€” Random numbers
Generate a NumPy array of five random numbers between 0 and 1.

Solution:

r = np.random.rand(5)
Enter fullscreen mode Exit fullscreen mode

Output:

array([0.23, 0.91, 0.49, 0.11, 0.76])
Enter fullscreen mode Exit fullscreen mode

Exercise 8 β€” Matrix multiplication
Given:

A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])
Enter fullscreen mode Exit fullscreen mode

Compute:
A @ B
(or np.dot(A, B))

Solution:

A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

A @ B   # or np.dot(A, B)
Enter fullscreen mode Exit fullscreen mode

Output:

array([[19, 22],
       [43, 50]])
Enter fullscreen mode Exit fullscreen mode

Exercise 9 β€” Mean of an array
Compute the mean of:
x = np.array([4, 8, 12, 16])

Solution:

x = np.array([4, 8, 12, 16])
np.mean(x)
Enter fullscreen mode Exit fullscreen mode

Output:

10.0
Enter fullscreen mode Exit fullscreen mode

Exercise 10 β€” Reshape
Given:
x = np.array([1, 2, 3, 4, 5, 6])

Reshape it into a 2Γ—3 matrix.

Solution:

x = np.array([1, 2, 3, 4, 5, 6])
x.reshape(2, 3)
Enter fullscreen mode Exit fullscreen mode

Output:

array([[1, 2, 3],
       [4, 5, 6]])
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Pandas Mini Exercises (Level: Very Easy)

Start with:

import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Exercise 1 β€” Create a DataFrame
Create a DataFrame from this dictionary:

data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}
Enter fullscreen mode Exit fullscreen mode

Solution:

data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
df
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • Dictionary keys β†’ column names
  • Lists β†’ column values

Output:

    Age    Salary
0   25     50000
1   30     60000
2   35     70000
Enter fullscreen mode Exit fullscreen mode

Exercise 2 β€” View data
Using the DataFrame from Exercise 1:

  1. Display the first 2 rows
  2. Display the column names
  3. Display the shape of the DataFrame

Solution:

df.head(2)
df.columns
df.shape
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • head(2) β†’ first 2 rows
  • columns β†’ column names
  • shape β†’ (rows, columns)

Output:

#first 2 rows
   Age  Salary  
0   25   50000
1   30   60000

#column names
Index(['Age', 'Salary'], dtype='object')

#3 rows, 2 columns
(3, 2)   
Enter fullscreen mode Exit fullscreen mode

Exercise 3 β€” Select a column
Select only the Salary column.

Solution:

df["Salary"]
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • Single brackets β†’ returns a Series

Output:

#This is a Series, not a DataFrame

0    50000
1    60000
2    70000
Name: Salary, dtype: int64
Enter fullscreen mode Exit fullscreen mode

Exercise 4 β€” Select multiple columns
Select Age and Salary together.

Solution:

df[["Age", "Salary"]]
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • Double brackets β†’ returns a DataFrame

Output:

#Double brackets β†’ DataFrame

   Age  Salary
0   25   50000
1   30   60000
2   35   70000
Enter fullscreen mode Exit fullscreen mode

Exercise 5 β€” Filter rows
From the DataFrame, select rows where:

  • Age > 28

Solution:

df[df["Age"] > 28]
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • Boolean condition filters rows; it is a core Pandas skill
  • Very common in data cleaning

Output:

   Age  Salary
1   30   60000
2   35   70000
Enter fullscreen mode Exit fullscreen mode

Exercise 6 β€” Add a new column
Add a column called Tax which is 10% of Salary.

Solution:

df["Tax"] = 0.10 * df["Salary"]
df
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • Pandas supports vectorized operations
  • Applied to entire column at once

Output:

# Operations apply row-wise automatically

   Age  Salary     Tax
0   25   50000  5000.0
1   30   60000  6000.0
2   35   70000  7000.0
Enter fullscreen mode Exit fullscreen mode

Exercise 7 β€” Basic statistics
Compute:

  1. Mean Age
  2. Maximum Salary

Solution:

df["Age"].mean()
df["Salary"].max()
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • Pandas has built-in descriptive stats
  • Used heavily during EDA

Output:

# Mean Age
30.0

#Maximum Salary
70000
Enter fullscreen mode Exit fullscreen mode

Exercise 8 β€” Handle missing values
Given:

data = {
    "Age": [25, None, 35],
    "Salary": [50000, 60000, None]
}
df = pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode
  1. Detect missing values
  2. Fill missing values with the column mean

Solution:

df.isnull()
df_filled = df.fillna(df.mean())
df
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • isnull() β†’ detects missing values
  • fillna(df.mean()) β†’ fills numeric NaNs with column mean

Output:

    Age   Salary
0  25.0  50000.0
1   NaN  60000.0
2  35.0      NaN
Enter fullscreen mode Exit fullscreen mode

Breaking this down here:
βœ… Detect missing values

df.isnull()
Enter fullscreen mode Exit fullscreen mode

Output:

     Age  Salary
0  False   False
1   True   False
2  False    True
Enter fullscreen mode Exit fullscreen mode

βœ… Fill missing values with mean

df_filled = df.fillna(df.mean())
df_filled
Enter fullscreen mode Exit fullscreen mode

Output:

    Age   Salary
0  25.0  50000.0
1  30.0  60000.0
2  35.0  55000.0
Enter fullscreen mode Exit fullscreen mode

🧠 Means used:

  • Age mean = 30
  • Salary mean = 55,000

Exercise 9 β€” Sort values
Sort the DataFrame by Salary (descending order).

Solution:

df.sort_values(by="Salary", ascending=False)
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • Sorting helps identify top/bottom values
  • Common during analysis

Output:

    Age   Salary
1  30.0  60000.0
2  35.0  55000.0
0  25.0  50000.0
Enter fullscreen mode Exit fullscreen mode

Exercise 10 β€” Convert to NumPy (ML step)
Convert:

  • Features β†’ Age, Salary
  • Target β†’ Tax

into NumPy arrays.

Solution:

X = df[["Age", "Salary"]].values
y = df["Tax"].values

X
Enter fullscreen mode Exit fullscreen mode

🧠 Explanation

  • .values converts Pandas β†’ NumPy
  • scikit-learn expects NumPy arrays

Output:

array([[2.5e+01, 5.0e+04],
       [3.0e+01, 6.0e+04],
       [3.5e+01, 5.5e+04]])
Enter fullscreen mode Exit fullscreen mode

βœ… Target

y
Enter fullscreen mode Exit fullscreen mode

Output:

array([50000., 60000., 55000.])
Enter fullscreen mode Exit fullscreen mode

🧠 This is exactly the format ML models expect

πŸ§ͺ Data Preprocessing: Code-Based Mini Exercises

Start with:

import pandas as pd
import numpy as np
Enter fullscreen mode Exit fullscreen mode

Exercise 1 β€” Train/Test Split (Ratio practice)
Given:

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Split the data into 80% training and 20% testing.
Use:

from sklearn.model_selection import train_test_split
Enter fullscreen mode Exit fullscreen mode

Exercise 2 β€” Detect missing values
Given:

data = {
    "Age": [25, 30, None, 40],
    "Salary": [50000, None, 70000, 80000]
}
df = pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Write code to:

  1. Detect missing values
  2. Count missing values per column

Exercise 3 β€” Fill missing values (Mean)
Using the same DataFrame above:
πŸ‘‰ Fill missing values using column mean.

Exercise 4 β€” One-Hot Encoding (Categorical Data)
Given:

df = pd.DataFrame({
    "City": ["Delhi", "Mumbai", "Delhi", "Chennai"]
})
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Convert City into numerical columns using one-hot encoding.

Exercise 5 β€” Feature Scaling (Standardization)
Given:

X = np.array([
    [20, 30000],
    [30, 50000],
    [40, 70000]
])
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Apply Standard Scaling to X.

Use:

from sklearn.preprocessing import StandardScaler
Enter fullscreen mode Exit fullscreen mode

Exercise 6 β€” Feature Selection
Given:

df = pd.DataFrame({
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000],
    "EmployeeID": [101, 102, 103]
})
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Remove the EmployeeID column.

Exercise 7 β€” Outlier Detection (Simple logic)
Given:
ages = np.array([22, 23, 24, 25, 120])

πŸ‘‰ Write code to remove values greater than 100.

Exercise 8 β€” Data Leakage Check (Thinking + Code)
Given:

from sklearn.preprocessing import StandardScaler
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Write the correct order of code to:

  1. Split data
  2. Fit scaler on training data
  3. Transform both training and test data

(No need to run it β€” just write the correct sequence.)

If you can do these exercises comfortably, you’re ML-ready at a foundational level.

I will be adding solutions for Data Preprocessing exercises in the subsequent post. If you are new to python, you can install python version 3.x and try playing around with these exercises in your IDE. I use Jupyter notebook.

You can refer to these posts for understanding NumPy, Pandas and Data Preprocessing:
Understanding NumPy in the context of Python for Machine Learning
The next basic concept of Machine Learning after NumPy: Pandas
Understanding Data Preprocessing

Top comments (0)