If you've just started your journey in data analytics, this guide walks you through how to use Pandas, Python's most popular data manipulation library, to clean real-world datasets. Every example here comes from projects I've actually worked on, including a Kenyan hospital operations dataset and a Nairobi housing statistics dataset.
What is Pandas?
Pandas is an open-source Python library built for data manipulation and analysis. Think of it as a supercharged Excel, but inside Python; you can load, filter, clean, transform, and summarize data all in code.
It gives you two core data structures:
- Series a single column of data:
import pandas as pd
ages = pd.Series([23, 25, 30, 28])
print(ages)
- DataFrame a full table with rows and columns:
students = {
"Name": ["Brian", "Caren", "Daisy"],
"Course": ["Data Analytics", "Data Engineering", "Data Science"],
"Age": [20, 23, 24]
}
df = pd.DataFrame(students)
df
In real projects, you'll almost always be working with DataFrames.
1. Loading Your Data
The first step in any project is getting your data into Pandas.
From a CSV file
df = pd.read_csv("patients.csv")
From an Excel file (single or multiple sheets)
# Single sheet
df = pd.read_excel("hospital_data.xlsx")
# Specific sheet by name
df = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Patients")
Loading multiple sheets at once
In my hospital operations project, I needed to load nine sheets from a single Excel workbook. Instead of repeating read_excel nine times, I used a loop:
import pandas as pd
patients = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Patients")
doctors = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Doctors")
appointments = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Appointments")
billing = pd.read_excel("hospital_operations_powerbi_sql_dataset.xlsx", sheet_name="Billing")
# ... and so on
Then I exported all of them to CSV in one go:
csv_tables = {
"patients": patients,
"doctors": doctors,
"appointments": appointments,
"billing": billing,
# add the rest...
}
for name, data in csv_tables.items():
data.to_csv(f"{name}.csv", index=False)
From JSON or SQL
# JSON
df = pd.read_json("data.json")
# SQL (using SQLAlchemy)
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://username:password@host:port/dbname")
df = pd.read_sql("SELECT * FROM patients", con=engine)
2. Exploring Your Data First
Before cleaning anything, understand what you're working with.
df.head() # First 5 rows
df.tail() # Last 5 rows
df.shape # (rows, columns)
df.columns # Column names
df.info() # Data types + null counts
df.describe() # Summary stats for numbers
df.describe(include='object') # Summary stats for text columns
Note this
df.info()is your best friend when you first open a dataset, it tells you data types AND which columns have missing values at a glance.
3. Handling Missing Values
Missing values are one of the most common data quality problems. Here's how to approach them systematically.
Check what's missing
df.isnull().sum()
# or
df.isna().sum()
Decision guide based on how much is missing
| % Missing | Recommended Action |
|---|---|
| < 5% | Drop the rows |
| 5% – 40% | Fill with median (numbers) or mode (categories) |
| > 40% | Consider dropping the column |
To calculate the percentage of missing cells in each row using pandas, use the code
df.isna().mean(axis=1) * 100
Fill missing values
# Numerical-fill with median
df["Age"] = df["Age"].fillna(df["Age"].median())
# Numerical-fill with mean
mean_charge = df["service_charge_kes"].mean()
df["service_charge_kes"] = df["service_charge_kes"].fillna(mean_charge)
# Categorical-fill with a default
df["Gender"] = df["Gender"].fillna("Unknown")
# Fill with zero (e.g., a score that's simply missing)
df["satisfaction_score"] = df["satisfaction_score"].fillna(0)
Drop missing values
# Drop rows where any value is missing
df = df.dropna()
# Drop rows with missing values in specific columns only
df = df.dropna(subset=["service_charge_kes"])
df = df.dropna(subset=["service_charge_kes", "age_of_building_years"])
4. Removing Duplicates
Duplicate records skew your analysis and inflate counts.
# Check how many duplicates exist
df.duplicated().sum()
# Remove them
df = df.drop_duplicates()
In healthcare datasets, duplicate patient records can cause serious reporting errors, so this step is non-negotiable.
5. Fixing Data Types
Pandas sometimes reads columns as the wrong type — for example, dates loaded as strings or numbers loaded as objects.
# Check all data types
df.dtypes
Convert dates
df["Column_date"] = pd.to_datetime(df["Column_date"])
If your data has messy or inconsistent date formats, use errors='coerce' to turn bad values into NaT instead of crashing your script:
df["order_date"] = pd.to_datetime(df["order_date"], errors='coerce')
What does
errors=['coerce','raise','ignore'] mean?
errors='coerce':Invalid values becomeNaT/NaN(safe)errors='raise': Script stops immediately on bad data (default)errors='ignore':Bad data is left unchanged
Convert multiple date columns at once
date_columns = ["order_date", "delivery_date", "last_login", "subscription_start"]
#what i did here is put
#the date columns in a list, then
#apply for loop
for col in date_columns:
df[col] = pd.to_datetime(df[col], errors='coerce')
Convert to numeric
df["Age"] = pd.to_numeric(df["Age"])
6. Cleaning Text Data
Text columns are messy — inconsistent casing, extra spaces, and different formats are all common.
# Standardize casing
df["County"] = df["County"].str.upper()
df["property_type"] = df["property_type"].str.lower()
df["property_type"] = df["property_type"].str.capitalize()
# Remove leading/trailing spaces
df["estate"] = df["estate"].str.strip()
# Replace values
df["Phone"] = df["Phone"].str.replace("254", "0")
# Check if a column contains a keyword
df[df["estate"].str.contains("Kasarani")]
7. Renaming Columns
Clean, consistent column names make your code easier to read.
df = df.rename(columns={
"Full Name": "full_name",
"Patient ID": "patient_id",
"Phone Number": "phone_number"
})
8. Filtering Data
Pandas makes it easy to slice and dice your data to focus on what matters.
Selecting columns
# One column
estates = df["estate"]
# Multiple columns
subset = df[["distance_to_cbd_km", "age_of_building_years", "security_rating"]]
Selecting rows with loc and iloc
# loc — by label/index value
df.loc[[30, 50, 70]] # Specific rows
df.loc[30, "furnishing"] # Specific cell
# iloc — by integer position
df.iloc[0] # First row
df.iloc[[0, 1, 2]] # First three rows
Filtering by condition
# Single condition
df[df["Age"] > 60]
df[df["County"] == "NAIROBI"]
# Multiple conditions
df[(df["Age"] > 18) & (df["Gender"] == "Female")]
df[df["bedrooms"] > 2]
9. Sorting Data
# Ascending (smallest to largest)
df.sort_values("bedrooms", ascending=True)
# Descending (largest to smallest)
df.sort_values("parking_slots", ascending=False)
10. Working with Dates
Once your dates are properly formatted, you can extract useful features from them.
Extract year, month, and day
df["order_year"] = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.month_name()
df["order_day"] = df["order_date"].dt.day_name()
Extract quarter
df["order_quarter"] = df["order_date"].dt.quarter
Extract from multiple columns at once
date_cols = ["order_date", "delivery_date", "last_login", "subscription_start"]
for col in date_cols:
df[f"{col}_month"] = df[col].dt.month
df[f"{col}_day"] = df[col].dt.day_name()
Calculate days between dates
df["delivery_days"] = (df["delivery_date"] - df["order_date"]).dt.days
11. Aggregating and Grouping Data
After cleaning, you can summarize your data to generate insights.
# Count patients per county
df.groupby("County")["Patient_ID"].count()
# Average age by gender
df.groupby("Gender")["Age"].mean()
# Median rent by furnishing type
df.groupby("furnishing")["monthly_rent_kes"].median()
# Median rent by property type
df.groupby("property_type")["monthly_rent_kes"].median()
# Quick frequency count
df["furnishing"].value_counts()
df["estate"].value_counts()
12. Feature Engineering
Feature engineering is creating new columns from existing data to unlock better insights.
Age categories
df["Age_Group"] = pd.cut(
df["Age"],
bins=[0, 18, 35, 60, 100],
labels=["Child", "Youth", "Adult", "Senior"]
)
Delivery time
df["delivery_days"] = (df["delivery_date"] - df["order_date"]).dt.days
13. Joining / Merging Datasets
Real-world data is rarely in one table. Pandas lets you join datasets like SQL.
students = pd.DataFrame({
"student_id": [101, 102, 103, 104, 105],
"student_name": ["Brian", "Caren", "Daisy", "Eric", "Faith"],
"course": ["Data Analytics", "Data Engineering", "Data Science", "Data Analytics", "Data Engineering"]
})
payments = pd.DataFrame({
"student_id": [101, 102, 104, 106, 107],
"amount_paid": [7500, 15000, 7500, 30000, 10500],
"payment_status": ["Partial", "Partial", "Partial", "Full", "Partial"]
})
# Inner join — only matching records
inner = pd.merge(students, payments, on="student_id", how="inner")
# Left join — all students, matched payments where available
left = pd.merge(students, payments, on="student_id", how="left")
# Right join — all payments, matched students where available
right = pd.merge(students, payments, on="student_id", how="right")
14. Quick Visualization
Pandas integrates directly with Matplotlib for fast plots.
import matplotlib.pyplot as plt
# Bar chart-patients per county
df["County"].value_counts().plot(kind="bar", title="Patients by County")
plt.show()
# Histogram - age distribution
df["Age"].plot(kind="hist", bins=10, title="Age Distribution")
plt.show()
# Line plot-monthly sales
df.groupby("Month")["Sales"].sum().plot(title="Monthly Sales")
plt.show()
Common Python Errors to Know
When you're just starting out, you'll hit these three types of errors:
SyntaxError — Python can't even read your code. Usually a missing colon, bracket, or indentation issue. The script won't run at all.
RuntimeError — Your code is valid Python, but hits an impossible instruction at runtime, like dividing by zero (ZeroDivisionError) or using a variable that doesn't exist (NameError).
LogicalError — The script runs without crashing, but gives the wrong answer because of a flaw in your logic. These are the hardest to spot:
# Intending to get the average of 4 and 6. Expected: 5
average = 4 + 6 / 2 # Returns 7.0, not 5.0 — order of operations!
# Correct way:
average = (4 + 6) / 2 # Returns 5.0
Full Project References
The code in this article is drawn from real datasets I worked with:
- Hospital Operations Dataset — Patients, MaritalStatus, Sex, NutritionalStatus, and more. View on GitHub
- Nairobi Housing Statistics Dataset — Property types, estates, rent prices, and features across Nairobi. View on GitHub
Wrapping Up
Data cleaning is the unglamorous but essential foundation of every good data project. Here's a checklist to keep handy:
- Load and inspect your data (
head,info,describe) - Check and handle missing values
- Remove duplicates
- Fix data types (especially dates)
- Clean text columns (casing, spaces, formatting)
- Filter, sort, and aggregate as needed
- Engineer new features where useful
- Join datasets when needed
Pandas makes all of this doable in a few lines of code. The more projects you work on, the more intuitive it becomes.
If you found this useful, feel free to drop a comment or connect happy to answer questions as you work through your first Pandas projects.
Top comments (0)