DEV Community

Cover image for Pandas for Data Cleaning: A Practical Guide for Beginners
joseph mwangi
joseph mwangi

Posted on

Pandas for Data Cleaning: A Practical Guide for Beginners

If you've just started your journey in data analytics, this guide walks you through how to use Pandas, Python's most popular data manipulation library, to clean real-world datasets. Every example here comes from projects I've actually worked on, including a Kenyan hospital operations dataset and a Nairobi housing statistics dataset.

What is Pandas?

Pandas is an open-source Python library built for data manipulation and analysis. Think of it as a supercharged Excel, but inside Python; you can load, filter, clean, transform, and summarize data all in code.
It gives you two core data structures:

  • Series a single column of data:
import pandas as pd

ages = pd.Series([23, 25, 30, 28])
print(ages)
Enter fullscreen mode Exit fullscreen mode
  • DataFrame a full table with rows and columns:
students = {
    "Name": ["Brian", "Caren", "Daisy"],
    "Course": ["Data Analytics", "Data Engineering", "Data Science"],
    "Age": [20, 23, 24]
}

df = pd.DataFrame(students)
df
Enter fullscreen mode Exit fullscreen mode

In real projects, you'll almost always be working with DataFrames.

1. Loading Your Data

The first step in any project is getting your data into Pandas.

From a CSV file

df = pd.read_csv("patients.csv")
Enter fullscreen mode Exit fullscreen mode

From an Excel file (single or multiple sheets)

# Single sheet
df = pd.read_excel("hospital_data.xlsx")

# Specific sheet by name
df = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Patients")
Enter fullscreen mode Exit fullscreen mode

Loading multiple sheets at once

In my hospital operations project, I needed to load nine sheets from a single Excel workbook. Instead of repeating read_excel nine times, I used a loop:

import pandas as pd

patients     = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Patients")
doctors      = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Doctors")
appointments = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Appointments")
billing      = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Billing")
# ... and so on
Enter fullscreen mode Exit fullscreen mode

Then I exported all of them to CSV in one go:

csv_tables = {
    "patients": patients,
    "doctors": doctors,
    "appointments": appointments,
    "billing": billing,
    # add the rest...
}

for name, data in csv_tables.items():
    data.to_csv(f"{name}.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

From JSON or SQL

# JSON
df = pd.read_json("data.json")

# SQL (using SQLAlchemy)
from sqlalchemy import create_engine

engine = create_engine("postgresql+psycopg2://username:password@host:port/dbname")
df = pd.read_sql("SELECT * FROM patients", con=engine)
Enter fullscreen mode Exit fullscreen mode

2. Exploring Your Data First

Before cleaning anything, understand what you're working with.

df.head()        # First 5 rows
df.tail()        # Last 5 rows
df.shape         # (rows, columns)
df.columns       # Column names
df.info()        # Data types + null counts
df.describe()    # Summary stats for numbers
df.describe(include='object')  # Summary stats for text columns
Enter fullscreen mode Exit fullscreen mode

Note this df.info() is your best friend when you first open a dataset, it tells you data types AND which columns have missing values at a glance.

3. Handling Missing Values

Missing values are one of the most common data quality problems. Here's how to approach them systematically.

Check what's missing

df.isnull().sum()
# or
df.isna().sum()
Enter fullscreen mode Exit fullscreen mode

Decision guide based on how much is missing

% Missing Recommended Action
< 5% Drop the rows
5% – 40% Fill with median (numbers) or mode (categories)
> 40% Consider dropping the column

To calculate the percentage of missing cells in each row using pandas, use the code


df.isna().mean(axis=1) * 100

Enter fullscreen mode Exit fullscreen mode

Fill missing values

# Numerical-fill with median
df["Age"] = df["Age"].fillna(df["Age"].median())

# Numerical-fill with mean
mean_charge = df["service_charge_kes"].mean()
df["service_charge_kes"] = df["service_charge_kes"].fillna(mean_charge)

# Categorical-fill with a default
df["Gender"] = df["Gender"].fillna("Unknown")

# Fill with zero (e.g., a score that's simply missing)
df["satisfaction_score"] = df["satisfaction_score"].fillna(0)
Enter fullscreen mode Exit fullscreen mode

Drop missing values

# Drop rows where any value is missing
df = df.dropna()

# Drop rows with missing values in specific columns only
df = df.dropna(subset=["service_charge_kes"])
df = df.dropna(subset=["service_charge_kes", "age_of_building_years"])
Enter fullscreen mode Exit fullscreen mode

4. Removing Duplicates

Duplicate records skew your analysis and inflate counts.

# Check how many duplicates exist
df.duplicated().sum()

# Remove them
df = df.drop_duplicates()
Enter fullscreen mode Exit fullscreen mode

In healthcare datasets, duplicate patient records can cause serious reporting errors, so this step is non-negotiable.

5. Fixing Data Types

Pandas sometimes reads columns as the wrong type — for example, dates loaded as strings or numbers loaded as objects.

# Check all data types
df.dtypes
Enter fullscreen mode Exit fullscreen mode

Convert dates

df["Column_date"] = pd.to_datetime(df["Column_date"])
Enter fullscreen mode Exit fullscreen mode

If your data has messy or inconsistent date formats, use errors='coerce' to turn bad values into NaT instead of crashing your script:

df["order_date"] = pd.to_datetime(df["order_date"], errors='coerce')
Enter fullscreen mode Exit fullscreen mode

What does errors=['coerce','raise','ignore'] mean?

  • errors='coerce':Invalid values become NaT/NaN (safe)
  • errors='raise': Script stops immediately on bad data (default)
  • errors='ignore':Bad data is left unchanged

Convert multiple date columns at once

date_columns = ["order_date", "delivery_date", "last_login", "subscription_start"]
#what i did here is put 
#the date columns in a list, then
#apply for loop
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')
Enter fullscreen mode Exit fullscreen mode

Convert to numeric

df["Age"] = pd.to_numeric(df["Age"])
Enter fullscreen mode Exit fullscreen mode

6. Cleaning Text Data

Text columns are messy — inconsistent casing, extra spaces, and different formats are all common.

# Standardize casing
df["County"] = df["County"].str.upper()
df["property_type"] = df["property_type"].str.lower()
df["property_type"] = df["property_type"].str.capitalize()

# Remove leading/trailing spaces
df["estate"] = df["estate"].str.strip()

# Replace values
df["Phone"] = df["Phone"].str.replace("254", "0")

# Check if a column contains a keyword
df[df["estate"].str.contains("Kasarani")]
Enter fullscreen mode Exit fullscreen mode

7. Renaming Columns

Clean, consistent column names make your code easier to read.

df = df.rename(columns={
    "Full Name": "full_name",
    "Patient ID": "patient_id",
    "Phone Number": "phone_number"
})
Enter fullscreen mode Exit fullscreen mode

8. Filtering Data

Pandas makes it easy to slice and dice your data to focus on what matters.

Selecting columns

# One column
estates = df["estate"]

# Multiple columns
subset = df[["distance_to_cbd_km", "age_of_building_years", "security_rating"]]
Enter fullscreen mode Exit fullscreen mode

Selecting rows with loc and iloc

# loc — by label/index value
df.loc[[30, 50, 70]]            # Specific rows
df.loc[30, "furnishing"]        # Specific cell

# iloc — by integer position
df.iloc[0]                      # First row
df.iloc[[0, 1, 2]]              # First three rows
Enter fullscreen mode Exit fullscreen mode

Filtering by condition

# Single condition
df[df["Age"] > 60]
df[df["County"] == "NAIROBI"]

# Multiple conditions
df[(df["Age"] > 18) & (df["Gender"] == "Female")]
df[df["bedrooms"] > 2]
Enter fullscreen mode Exit fullscreen mode

9. Sorting Data

# Ascending (smallest to largest)
df.sort_values("bedrooms", ascending=True)

# Descending (largest to smallest)
df.sort_values("parking_slots", ascending=False)
Enter fullscreen mode Exit fullscreen mode

10. Working with Dates

Once your dates are properly formatted, you can extract useful features from them.

Extract year, month, and day

df["order_year"]  = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.month_name()
df["order_day"]   = df["order_date"].dt.day_name()
Enter fullscreen mode Exit fullscreen mode

Extract quarter

df["order_quarter"] = df["order_date"].dt.quarter
Enter fullscreen mode Exit fullscreen mode

Extract from multiple columns at once

date_cols = ["order_date", "delivery_date", "last_login", "subscription_start"]

for col in date_cols:
    df[f"{col}_month"] = df[col].dt.month
    df[f"{col}_day"]   = df[col].dt.day_name()
Enter fullscreen mode Exit fullscreen mode

Calculate days between dates

df["delivery_days"] = (df["delivery_date"] - df["order_date"]).dt.days
Enter fullscreen mode Exit fullscreen mode

11. Aggregating and Grouping Data

After cleaning, you can summarize your data to generate insights.

# Count patients per county
df.groupby("County")["Patient_ID"].count()

# Average age by gender
df.groupby("Gender")["Age"].mean()

# Median rent by furnishing type
df.groupby("furnishing")["monthly_rent_kes"].median()

# Median rent by property type
df.groupby("property_type")["monthly_rent_kes"].median()

# Quick frequency count
df["furnishing"].value_counts()
df["estate"].value_counts()
Enter fullscreen mode Exit fullscreen mode

12. Feature Engineering

Feature engineering is creating new columns from existing data to unlock better insights.

Age categories

df["Age_Group"] = pd.cut(
    df["Age"],
    bins=[0, 18, 35, 60, 100],
    labels=["Child", "Youth", "Adult", "Senior"]
)
Enter fullscreen mode Exit fullscreen mode

Delivery time

df["delivery_days"] = (df["delivery_date"] - df["order_date"]).dt.days
Enter fullscreen mode Exit fullscreen mode

13. Joining / Merging Datasets

Real-world data is rarely in one table. Pandas lets you join datasets like SQL.

students = pd.DataFrame({
    "student_id": [101, 102, 103, 104, 105],
    "student_name": ["Brian", "Caren", "Daisy", "Eric", "Faith"],
    "course": ["Data Analytics", "Data Engineering", "Data Science", "Data Analytics", "Data Engineering"]
})

payments = pd.DataFrame({
    "student_id": [101, 102, 104, 106, 107],
    "amount_paid": [7500, 15000, 7500, 30000, 10500],
    "payment_status": ["Partial", "Partial", "Partial", "Full", "Partial"]
})

# Inner join — only matching records
inner = pd.merge(students, payments, on="student_id", how="inner")

# Left join — all students, matched payments where available
left = pd.merge(students, payments, on="student_id", how="left")

# Right join — all payments, matched students where available
right = pd.merge(students, payments, on="student_id", how="right")
Enter fullscreen mode Exit fullscreen mode

14. Quick Visualization

Pandas integrates directly with Matplotlib for fast plots.

import matplotlib.pyplot as plt

# Bar chart-patients per county
df["County"].value_counts().plot(kind="bar", title="Patients by County")
plt.show()


# Histogram - age distribution
df["Age"].plot(kind="hist", bins=10, title="Age Distribution")
plt.show()

# Line plot-monthly sales
df.groupby("Month")["Sales"].sum().plot(title="Monthly Sales")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Common Python Errors to Know

When you're just starting out, you'll hit these three types of errors:

SyntaxError — Python can't even read your code. Usually a missing colon, bracket, or indentation issue. The script won't run at all.

RuntimeError — Your code is valid Python, but hits an impossible instruction at runtime, like dividing by zero (ZeroDivisionError) or using a variable that doesn't exist (NameError).

LogicalError — The script runs without crashing, but gives the wrong answer because of a flaw in your logic. These are the hardest to spot:

# Intending to get the average of 4 and 6. Expected: 5
average = 4 + 6 / 2  # Returns 7.0, not 5.0 — order of operations!

# Correct way:
average = (4 + 6) / 2  # Returns 5.0
Enter fullscreen mode Exit fullscreen mode

Full Project References

The code in this article is drawn from real datasets I worked with:

  • Hospital Operations Dataset — Patients, MaritalStatus, Sex, NutritionalStatus, and more. View on GitHub
  • Nairobi Housing Statistics Dataset — Property types, estates, rent prices, and features across Nairobi. View on GitHub

Wrapping Up

Data cleaning is the unglamorous but essential foundation of every good data project. Here's a checklist to keep handy:

  • Load and inspect your data (head, info, describe)
  • Check and handle missing values
  • Remove duplicates
  • Fix data types (especially dates)
  • Clean text columns (casing, spaces, formatting)
  • Filter, sort, and aggregate as needed
  • Engineer new features where useful
  • Join datasets when needed

Pandas makes all of this doable in a few lines of code. The more projects you work on, the more intuitive it becomes.

If you found this useful, feel free to drop a comment or connect happy to answer questions as you work through your first Pandas projects.

Top comments (0)