The Reality of Real‑World Data
Over the past few days, we have been working with perfect, pristine datasets. I built those datasets specifically so we could focus on learning commands like filter() and groupby() without any errors.
However, out in the real world, data is incredibly messy. Humans make typos when entering data, sensors go offline and miss readings, and database migrations often corrupt text. When Pandas encounters an empty cell in a CSV file, it fills it with a special marker called NaN (Not a Number).
If you try to run mathematical operations on a column filled with NaNs, your analysis will either crash or, even worse, return mathematically incorrect results that could lead to terrible business decisions. Today, I am going to teach you how to identify and clean this messy data professionally.
Note: Because our interactive workspace acts just like a real Jupyter Notebook, we only need to load our data in the very first cell. The remaining cells will remember the variables!
Step 1: Diagnosing the Mess (Finding Missing Data)
Before you can clean a house, you need to know where the dirt is. You cannot manually scroll through 500,000 rows looking for empty cells. Instead, we use Pandas diagnostic tools.
The combination of isnull() and sum() is the gold standard for diagnosing missing data. It will instantly tell you exactly how many missing values exist in every single column of your DataFrame.
import pandas as pd
import io
# Let's create a highly realistic, messy dataset
# Notice the blank spaces indicating missing data
messy_csv = """Emp_ID,Name,Department,Salary,Rating
101,Ali,Sales,65000,4.5
102,Sara,IT,,4.8
103,,Marketing,72000,3.9
104,Zoya,IT,88000,
105,Mike,,61000,4.1"""
df = pd.read_csv(io.StringIO(messy_csv))
# Diagnosing missing values
missing_data_report = df.isnull().sum()
print("--- Missing Values Report ---")
print(missing_data_report)
Look at that output. In less than a second, Pandas told us that we are missing 1 Name, 1 Department, 1 Salary, and 1 Rating. Now that we know where the problems are, we can fix them.
Step 2: Dropping Missing Data (The Nuclear Option)
The easiest and fastest way to deal with missing data is simply to delete any row that contains an empty value. In Pandas, we do this using the dropna() function.
However, this is the 'nuclear option'. If you drop rows blindly, you might lose valuable data. For example, if a row is missing a 'Rating' but has the employee's 'Salary', dropping the entire row deletes that perfectly good salary data too. Let's see what happens to our df.
# Dropping any row that contains at least one NaN value
clean_df_dropped = df.dropna()
print("--- Data After dropna() ---")
print(clean_df_dropped)
Top comments (0)