joseph mwangi

Posted on Jun 14

Pandas for Data Cleaning: A Practical Guide for Beginners

#python #datascience #beginners #tutorial

If you've just started your journey in data analytics, this guide walks you through how to use Pandas, Python's most popular data manipulation library, to clean real-world datasets. Every example here comes from projects I've actually worked on, including a Kenyan hospital operations dataset and a Nairobi housing statistics dataset.

What is Pandas?

Pandas is an open-source Python library built for data manipulation and analysis. Think of it as a supercharged Excel, but inside Python; you can load, filter, clean, transform, and summarize data all in code.
It gives you two core data structures:

Series a single column of data:

import pandas as pd

ages = pd.Series([23, 25, 30, 28])
print(ages)

DataFrame a full table with rows and columns:

students = {
    "Name": ["Brian", "Caren", "Daisy"],
    "Course": ["Data Analytics", "Data Engineering", "Data Science"],
    "Age": [20, 23, 24]
}

df = pd.DataFrame(students)
df

In real projects, you'll almost always be working with DataFrames.

1. Loading Your Data

The first step in any project is getting your data into Pandas.

From a CSV file

df = pd.read_csv("patients.csv")

From an Excel file (single or multiple sheets)

# Single sheet
df = pd.read_excel("hospital_data.xlsx")

# Specific sheet by name
df = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Patients")

Loading multiple sheets at once

In my hospital operations project, I needed to load nine sheets from a single Excel workbook. Instead of repeating read_excel nine times, I used a loop:

import pandas as pd

patients     = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Patients")
doctors      = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Doctors")
appointments = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Appointments")
billing      = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Billing")
# ... and so on

Then I exported all of them to CSV in one go:

csv_tables = {
    "patients": patients,
    "doctors": doctors,
    "appointments": appointments,
    "billing": billing,
    # add the rest...
}

for name, data in csv_tables.items():
    data.to_csv(f"{name}.csv", index=False)

From JSON or SQL

# JSON
df = pd.read_json("data.json")

# SQL (using SQLAlchemy)
from sqlalchemy import create_engine

engine = create_engine("postgresql+psycopg2://username:password@host:port/dbname")
df = pd.read_sql("SELECT * FROM patients", con=engine)

2. Exploring Your Data First

Before cleaning anything, understand what you're working with.

df.head()        # First 5 rows
df.tail()        # Last 5 rows
df.shape         # (rows, columns)
df.columns       # Column names
df.info()        # Data types + null counts
df.describe()    # Summary stats for numbers
df.describe(include='object')  # Summary stats for text columns

Note this df.info() is your best friend when you first open a dataset, it tells you data types AND which columns have missing values at a glance.

3. Handling Missing Values

Missing values are one of the most common data quality problems. Here's how to approach them systematically.

Check what's missing

df.isnull().sum()
# or
df.isna().sum()

Decision guide based on how much is missing

% Missing	Recommended Action
< 5%	Drop the rows
5% – 40%	Fill with median (numbers) or mode (categories)
> 40%	Consider dropping the column

To calculate the percentage of missing cells in each row using pandas, use the code


df.isna().mean(axis=1) * 100

Fill missing values

# Numerical-fill with median
df["Age"] = df["Age"].fillna(df["Age"].median())

# Numerical-fill with mean
mean_charge = df["service_charge_kes"].mean()
df["service_charge_kes"] = df["service_charge_kes"].fillna(mean_charge)

# Categorical-fill with a default
df["Gender"] = df["Gender"].fillna("Unknown")

# Fill with zero (e.g., a score that's simply missing)
df["satisfaction_score"] = df["satisfaction_score"].fillna(0)

Drop missing values

# Drop rows where any value is missing
df = df.dropna()

# Drop rows with missing values in specific columns only
df = df.dropna(subset=["service_charge_kes"])
df = df.dropna(subset=["service_charge_kes", "age_of_building_years"])

4. Removing Duplicates

Duplicate records skew your analysis and inflate counts.

# Check how many duplicates exist
df.duplicated().sum()

# Remove them
df = df.drop_duplicates()

In healthcare datasets, duplicate patient records can cause serious reporting errors, so this step is non-negotiable.

5. Fixing Data Types

Pandas sometimes reads columns as the wrong type — for example, dates loaded as strings or numbers loaded as objects.

# Check all data types
df.dtypes

Convert dates

df["Column_date"] = pd.to_datetime(df["Column_date"])

If your data has messy or inconsistent date formats, use errors='coerce' to turn bad values into NaT instead of crashing your script:

df["order_date"] = pd.to_datetime(df["order_date"], errors='coerce')

What does errors=['coerce','raise','ignore'] mean?

errors='coerce':Invalid values become NaT/NaN (safe)

errors='raise': Script stops immediately on bad data (default)

errors='ignore':Bad data is left unchanged

Convert multiple date columns at once

date_columns = ["order_date", "delivery_date", "last_login", "subscription_start"]
#what i did here is put 
#the date columns in a list, then
#apply for loop
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')

Convert to numeric

df["Age"] = pd.to_numeric(df["Age"])

6. Cleaning Text Data

Text columns are messy — inconsistent casing, extra spaces, and different formats are all common.

# Standardize casing
df["County"] = df["County"].str.upper()
df["property_type"] = df["property_type"].str.lower()
df["property_type"] = df["property_type"].str.capitalize()

# Remove leading/trailing spaces
df["estate"] = df["estate"].str.strip()

# Replace values
df["Phone"] = df["Phone"].str.replace("254", "0")

# Check if a column contains a keyword
df[df["estate"].str.contains("Kasarani")]

7. Renaming Columns

Clean, consistent column names make your code easier to read.

df = df.rename(columns={
    "Full Name": "full_name",
    "Patient ID": "patient_id",
    "Phone Number": "phone_number"
})

8. Filtering Data

Pandas makes it easy to slice and dice your data to focus on what matters.

Selecting columns

# One column
estates = df["estate"]

# Multiple columns
subset = df[["distance_to_cbd_km", "age_of_building_years", "security_rating"]]

Selecting rows with `loc` and `iloc`

# loc — by label/index value
df.loc[[30, 50, 70]]            # Specific rows
df.loc[30, "furnishing"]        # Specific cell

# iloc — by integer position
df.iloc[0]                      # First row
df.iloc[[0, 1, 2]]              # First three rows

Filtering by condition

# Single condition
df[df["Age"] > 60]
df[df["County"] == "NAIROBI"]

# Multiple conditions
df[(df["Age"] > 18) & (df["Gender"] == "Female")]
df[df["bedrooms"] > 2]

9. Sorting Data

# Ascending (smallest to largest)
df.sort_values("bedrooms", ascending=True)

# Descending (largest to smallest)
df.sort_values("parking_slots", ascending=False)

10. Working with Dates

Once your dates are properly formatted, you can extract useful features from them.

Extract year, month, and day

df["order_year"]  = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.month_name()
df["order_day"]   = df["order_date"].dt.day_name()

Extract quarter

df["order_quarter"] = df["order_date"].dt.quarter

Extract from multiple columns at once

date_cols = ["order_date", "delivery_date", "last_login", "subscription_start"]

for col in date_cols:
    df[f"{col}_month"] = df[col].dt.month
    df[f"{col}_day"]   = df[col].dt.day_name()

Calculate days between dates

df["delivery_days"] = (df["delivery_date"] - df["order_date"]).dt.days

11. Aggregating and Grouping Data

After cleaning, you can summarize your data to generate insights.

# Count patients per county
df.groupby("County")["Patient_ID"].count()

# Average age by gender
df.groupby("Gender")["Age"].mean()

# Median rent by furnishing type
df.groupby("furnishing")["monthly_rent_kes"].median()

# Median rent by property type
df.groupby("property_type")["monthly_rent_kes"].median()

# Quick frequency count
df["furnishing"].value_counts()
df["estate"].value_counts()

12. Feature Engineering

Feature engineering is creating new columns from existing data to unlock better insights.

Age categories

df["Age_Group"] = pd.cut(
    df["Age"],
    bins=[0, 18, 35, 60, 100],
    labels=["Child", "Youth", "Adult", "Senior"]
)

Delivery time

df["delivery_days"] = (df["delivery_date"] - df["order_date"]).dt.days

13. Joining / Merging Datasets

Real-world data is rarely in one table. Pandas lets you join datasets like SQL.

students = pd.DataFrame({
    "student_id": [101, 102, 103, 104, 105],
    "student_name": ["Brian", "Caren", "Daisy", "Eric", "Faith"],
    "course": ["Data Analytics", "Data Engineering", "Data Science", "Data Analytics", "Data Engineering"]
})

payments = pd.DataFrame({
    "student_id": [101, 102, 104, 106, 107],
    "amount_paid": [7500, 15000, 7500, 30000, 10500],
    "payment_status": ["Partial", "Partial", "Partial", "Full", "Partial"]
})

# Inner join — only matching records
inner = pd.merge(students, payments, on="student_id", how="inner")

# Left join — all students, matched payments where available
left = pd.merge(students, payments, on="student_id", how="left")

# Right join — all payments, matched students where available
right = pd.merge(students, payments, on="student_id", how="right")

14. Quick Visualization

Pandas integrates directly with Matplotlib for fast plots.

import matplotlib.pyplot as plt

# Bar chart-patients per county
df["County"].value_counts().plot(kind="bar", title="Patients by County")
plt.show()


# Histogram - age distribution
df["Age"].plot(kind="hist", bins=10, title="Age Distribution")
plt.show()

# Line plot-monthly sales
df.groupby("Month")["Sales"].sum().plot(title="Monthly Sales")
plt.show()

Common Python Errors to Know

When you're just starting out, you'll hit these three types of errors:

SyntaxError — Python can't even read your code. Usually a missing colon, bracket, or indentation issue. The script won't run at all.

RuntimeError — Your code is valid Python, but hits an impossible instruction at runtime, like dividing by zero (ZeroDivisionError) or using a variable that doesn't exist (NameError).

LogicalError — The script runs without crashing, but gives the wrong answer because of a flaw in your logic. These are the hardest to spot:

# Intending to get the average of 4 and 6. Expected: 5
average = 4 + 6 / 2  # Returns 7.0, not 5.0 — order of operations!

# Correct way:
average = (4 + 6) / 2  # Returns 5.0

Full Project References

The code in this article is drawn from real datasets I worked with:

Hospital Operations Dataset — Patients, MaritalStatus, Sex, NutritionalStatus, and more. View on GitHub
Nairobi Housing Statistics Dataset — Property types, estates, rent prices, and features across Nairobi. View on GitHub

Wrapping Up

Data cleaning is the unglamorous but essential foundation of every good data project. Here's a checklist to keep handy:

Load and inspect your data (head, info, describe)
Check and handle missing values
Remove duplicates
Fix data types (especially dates)
Clean text columns (casing, spaces, formatting)
Filter, sort, and aggregate as needed
Engineer new features where useful
Join datasets when needed

Pandas makes all of this doable in a few lines of code. The more projects you work on, the more intuitive it becomes.

If you found this useful, feel free to drop a comment or connect happy to answer questions as you work through your first Pandas projects.