DEV Community

Cover image for Mastering Pandas — Part 3: Data Cleaning, Merging & Joining
Hussein Mahdi
Hussein Mahdi

Posted on

Mastering Pandas — Part 3: Data Cleaning, Merging & Joining

Pandas for Data Science Series — Article #3


Real Data Is Never Clean

In Part 2, you learned how to group and slice a DataFrame with precision. But in real projects, the data you receive rarely arrives ready to analyze. Column names have spaces. Ages are negative. Phone numbers mix letters with digits. Duplicate rows sneak in. And your data often lives across multiple files that need to be combined before you can do anything useful.

This article covers the two skills that bridge raw data and real analysis:

  • Data Cleaning — detecting and fixing everything wrong with your data before it corrupts your results
  • Merging & Joining — combining separate DataFrames into a single, unified dataset

These topics flow naturally into each other. You clean each source first, then you combine them. That's the order every data analyst follows in practice — and it's the order we'll follow here.

We'll use this sample dataset throughout the cleaning section:

import pandas as pd
import numpy as np

data = {
    "Name":             ["Alice", "bob", "Carol", "Alice", "  dave  ", "Eve"],
    "Age":              [25, -5, 200, 25, 30, None],
    "Phone":            ["055-123-4567", None, "55123x", "055-123-4567", "055-987-0000", "055abc"],
    "Email":            ["alice@mail.com", "bob@mail.com", None, "alice@mail.com", "dave@mail.com", "not-an-email"],
    "Paying_Customer":  ["Y", "N", "Yes", "Y", "N", "y"],
    "Score":            [85, 110, 72, 85, -10, 90],
}
df = pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode

Step 1 — Inspect Before You Touch Anything

The most important rule in data cleaning: look before you act. Jumping straight into fixes without understanding the full picture leads to mistakes you won't catch until much later.

These are the inspection tools you should run on every new dataset:

df.shape          # (rows, columns) — how big is this?
df.info()         # column names, data types, non-null counts
df.head()         # first 5 rows
df.tail()         # last 5 rows
df.describe()     # min, max, mean, std — spot impossible values here
Enter fullscreen mode Exit fullscreen mode

For missing values specifically:

df.isna().sum()           # count NaN per column
df.isnull().sum()         # same as isna() — both work
df[df["Phone"].notna()]   # rows that actually have a phone number
Enter fullscreen mode Exit fullscreen mode

For duplicates and inconsistent categories:

df.duplicated()                         # True/False per row
df[df.duplicated()]                     # show only the duplicate rows
df["Paying_Customer"].value_counts()    # reveals: Y, N, Yes, y — all meaning the same thing
Enter fullscreen mode Exit fullscreen mode

Always run describe() on numeric columns — the min and max alone will tell you immediately if something is wrong. A minimum age of -5 or a score of 110 out of 100 is a red flag you can't miss.


Step 2 — Fix Column Names First

Messy column names make everything harder. Clean them before anything else, so every subsequent operation uses predictable, consistent names.

Rename specific columns:

df.rename(columns={"Old Name": "New_Name", "phone number": "Phone"})
Enter fullscreen mode Exit fullscreen mode

Clean all column names at once — the standard one-liner:

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
Enter fullscreen mode Exit fullscreen mode

This strips surrounding spaces, converts to lowercase, and replaces spaces with underscores in a single chain. After this, "First Name" becomes "first_name", " Age " becomes "age", and so on.

Before:              After:
"First Name"    →    "first_name"
"  Age "        →    "age"
"Phone Number"  →    "phone_number"
"ZIP Code"      →    "zip_code"
Enter fullscreen mode Exit fullscreen mode

Step 3 — Handle Missing Values

Missing values appear as NaN in Pandas. Before deciding what to do with them, you need to understand why they're missing — because the right fix depends on the cause.

Fill missing values with a fixed value:

df["Phone"].fillna("000-000-0000")    # placeholder for missing phone
df["Age"].fillna(0)                   # fill numeric with 0
df["Name"].fillna("Unknown")          # fill text with label
df["Score"].fillna(df["Score"].mean()) # fill with column average
Enter fullscreen mode Exit fullscreen mode

Fill missing numeric values by estimating from neighbors:

df["Sales"] = df["Sales"].interpolate()
Enter fullscreen mode Exit fullscreen mode

interpolate() fills NaN by calculating the midpoint between the values above and below it. This is ideal for time series or sensor data where values change gradually — it would be wrong to use for random datasets.

Before:     After interpolate():
100         100
NaN    →    150   ← estimated midpoint
200         200
NaN    →    225   ← estimated midpoint
250         250
Enter fullscreen mode Exit fullscreen mode

Remove rows with missing values:

df.dropna()                            # drop any row with at least one NaN
df.dropna(subset=["Phone"])            # drop only if Phone is NaN
df.dropna(thresh=3)                    # drop rows that have fewer than 3 non-NaN values
Enter fullscreen mode Exit fullscreen mode

inplace rule: use either inplace=True or variable assignment — never both. They do the same thing, and combining them produces a bug where df becomes None.

df.dropna(subset=["Phone"], inplace=True)   # ✅ modifies df directly
df = df.dropna(subset=["Phone"])            # ✅ returns new df
df = df.dropna(subset=["Phone"], inplace=True)  # ❌ df becomes None

Step 4 — Remove Duplicates

Duplicate rows silently inflate counts, averages, and totals. Always check for them.

df.drop_duplicates()                         # remove fully identical rows
df.drop_duplicates(subset=["Name"])          # remove rows with duplicate Name only
df.drop_duplicates(subset=["Name", "Email"]) # remove rows where both columns match
df.drop_duplicates(keep="last")              # keep the last occurrence instead of the first
Enter fullscreen mode Exit fullscreen mode
Before:                        After drop_duplicates():
Name   Email                   Name   Email
Alice  alice@mail.com          Alice  alice@mail.com
Bob    bob@mail.com      →     Bob    bob@mail.com
Alice  alice@mail.com ← dup    Carol  carol@mail.com
Carol  carol@mail.com
Enter fullscreen mode Exit fullscreen mode

Step 5 — Fix Data Types

Pandas sometimes reads numeric columns as strings, or dates as plain text. Operations on wrong types either crash or silently produce wrong results.

Change column type directly:

df["Zip_Code"] = df["Zip_Code"].astype(str)     # treat as text, not a number
df["Age"] = df["Age"].astype(int)               # convert to integer
df["Price"] = df["Price"].astype(float)         # convert to decimal
Enter fullscreen mode Exit fullscreen mode

Handle messy numbers that contain symbols or text:

df["Price"] = pd.to_numeric(df["Price"], errors="coerce")
Enter fullscreen mode Exit fullscreen mode

errors="coerce" converts anything that can't be parsed into NaN instead of crashing. This is the safe way to clean columns like "$1,200" or "N/A" that can't be directly cast with astype().

Before:           After pd.to_numeric(..., errors="coerce"):
"1200"    →       1200.0
"$850"    →       NaN     ← couldn't parse
"700.5"   →       700.5
"N/A"     →       NaN     ← couldn't parse
Enter fullscreen mode Exit fullscreen mode

Convert to datetime:

df["Date"] = pd.to_datetime(df["Date"])
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")  # specify format if needed
Enter fullscreen mode Exit fullscreen mode

Once a column is proper datetime, you unlock sorting by date, filtering by year/month, calculating time differences, and much more.


Step 6 — Clean Strings

String columns are almost always the messiest part of a real dataset. Pandas provides a full .str accessor with everything you need.

Strip unwanted characters from edges:

df["Name"] = df["Name"].str.strip()           # remove spaces from both sides
df["Name"] = df["Name"].str.strip(" ./-")     # remove spaces, dots, slashes, dashes
df["Name"] = df["Name"].str.lstrip()          # left side only
df["Name"] = df["Name"].str.rstrip()          # right side only
Enter fullscreen mode Exit fullscreen mode

Replace characters anywhere in the string:

df["Phone"] = df["Phone"].str.replace(r"[a-zA-Z]", "", regex=True)   # remove all letters
df["Phone"] = df["Phone"].str.replace(r"\D", "", regex=True)          # keep digits only
df["Name"]  = df["Name"].str.replace(r"[./]", "", regex=True)         # remove dots and slashes
Enter fullscreen mode Exit fullscreen mode

Standardize text case:

df["Name"] = df["Name"].str.lower()     # all lowercase → "alice"
df["Name"] = df["Name"].str.upper()     # all uppercase → "ALICE"
df["Name"] = df["Name"].str.title()     # title case    → "Alice"
Enter fullscreen mode Exit fullscreen mode

Check text content:

df[df["Email"].str.contains("@")]           # filter valid-looking emails
df[df["Phone"].str.len() != 12]             # find rows where phone length is wrong
df[df["Name"].str.startswith("A")]          # names starting with A
df[df["Name"].str.endswith("e")]            # names ending with e
Enter fullscreen mode Exit fullscreen mode

Extract a pattern using regex:

df["Zip"] = df["Address"].str.extract(r"(\d{5})")        # pull 5-digit zip code
df["Area_Code"] = df["Phone"].str.extract(r"^(\d{3})")   # pull first 3 digits
Enter fullscreen mode Exit fullscreen mode

Split one column into multiple columns:

df[["Street", "State", "Zip"]] = (
    df["Address"]
    .str.split(",", n=2, expand=True)
)
Enter fullscreen mode Exit fullscreen mode

n=2 limits the split to 2 times, producing exactly 3 parts regardless of how many commas the string contains. expand=True spreads those parts into separate columns.

Before:                              After:
"123 Main St, Texas, 75001"   →   Street="123 Main St"  State="Texas"  Zip="75001"
Enter fullscreen mode Exit fullscreen mode

Step 7 — Replace and Standardize Values

Raw data often has the same value written multiple ways: "Y", "Yes", "yes", "y" — all meaning the same thing. Standardize them.

Replace whole cell values using a dictionary:

df["Status"] = df["Status"].replace({"Y": "Yes", "N": "No", "y": "Yes"})

# Apply to multiple columns at once
df[["Paying_Customer", "Do_Not_Contact"]] = (
    df[["Paying_Customer", "Do_Not_Contact"]]
    .replace({"Y": "Yes", "N": "No", "y": "Yes", "n": "No"})
)
Enter fullscreen mode Exit fullscreen mode

Overwrite invalid values using loc[]:

df.loc[df["Age"] < 0,   "Age"] = None     # negative age → invalid
df.loc[df["Age"] > 120, "Age"] = None     # age over 120 → invalid
df.loc[df["Price"] < 0, "Price"] = None   # negative price → invalid
Enter fullscreen mode Exit fullscreen mode

Keep valid values, replace everything else with NaN:

df["Salary"] = df["Salary"].where(df["Salary"] > 0)
# Rows where Salary <= 0 become NaN; the rest are untouched
Enter fullscreen mode Exit fullscreen mode

Step 8 — Handle Outliers

An outlier is a value that exists but shouldn't — a score of 110/100, an age of 200, a price of -50. clip() is the clean way to enforce valid ranges.

Cap values within a valid range:

df["Score"] = df["Score"].clip(0, 100)      # 110 → 100, -10 → 0
df["Age"]   = df["Age"].clip(0, 120)        # 200 → 120, -5  → 0
df["Age"]   = np.clip(df["Age"], 0, 120)    # NumPy version — same result
Enter fullscreen mode Exit fullscreen mode
Before:   After clip(0, 100):
85        85
110   →   100   ← capped at max
-10   →   0     ← capped at min
72        72
Enter fullscreen mode Exit fullscreen mode

clip() is gentler than dropping rows — it preserves the row but brings the outlier within the valid boundary. Use it when the row has useful data in other columns that you don't want to lose.


Step 9 — Drop Rows and Columns

Sometimes the cleanest fix is simply to remove what you don't need.

Drop rows by index:

df.drop(0)              # remove row at index 0
df.drop([0, 2, 5])      # remove multiple rows by index
Enter fullscreen mode Exit fullscreen mode

Drop columns:

df.drop(columns=["Age", "Phone"])      # preferred style
df.drop("Age", axis=1)                 # axis=1 means "columns"
df.drop(["Age", "Phone"], axis=1)      # multiple columns
Enter fullscreen mode Exit fullscreen mode

Step 10 — Create New Columns

Cleaning isn't only about removing bad data — it's also about building the right structure for analysis.

Fixed value:

df["Source"] = "CSV Import"    # tag every row with its data source
Enter fullscreen mode Exit fullscreen mode

Calculated from another column:

df["Price_With_Tax"] = df["Price"] * 1.15
df["Name_Length"]    = df["Name"].str.len()
Enter fullscreen mode Exit fullscreen mode

Based on a condition:

df["Status"] = np.where(df["Age"] >= 18, "Adult", "Minor")
Enter fullscreen mode Exit fullscreen mode

Based on multiple conditions:

conditions = [df["Score"] >= 90, df["Score"] >= 70, df["Score"] >= 50]
choices    = ["A", "B", "C"]
df["Grade"] = np.select(conditions, choices, default="F")
Enter fullscreen mode Exit fullscreen mode

Custom logic with apply():

df["Name"] = df["Name"].apply(lambda x: x.title() if isinstance(x, str) else x)

def categorize_price(price):
    if price > 1000: return "Expensive"
    elif price > 500: return "Medium"
    else:            return "Cheap"

df["Category"] = df["Price"].apply(categorize_price)
Enter fullscreen mode Exit fullscreen mode

Step 11 — Add New Rows

Add a single row using concat():

new_row = pd.DataFrame([{"Name": "Frank", "Age": 28, "Phone": "055-111-2222"}])
df = pd.concat([df, new_row], ignore_index=True)
Enter fullscreen mode Exit fullscreen mode

Add multiple rows at once:

new_rows = pd.DataFrame([
    {"Name": "Grace", "Age": 31},
    {"Name": "Hank",  "Age": 24},
])
df = pd.concat([df, new_rows], ignore_index=True)
Enter fullscreen mode Exit fullscreen mode

ignore_index=True resets the index after concatenation so there are no gaps or repeated numbers:

Without ignore_index:    With ignore_index=True:
0  Alice                 0  Alice
1  Bob          →        1  Bob
0  Grace                 2  Grace   ← clean
1  Hank                  3  Hank    ← clean
Enter fullscreen mode Exit fullscreen mode

Step 12 — Reset the Index

After dropping rows, the index has gaps. Reset it before analysis or before passing data to another function.

df = df.reset_index(drop=True)
Enter fullscreen mode Exit fullscreen mode

drop=True discards the old index numbers entirely. drop=False keeps the old index as a new column — useful if you need to trace where a row originally came from.


Putting It All Together — A Real Cleaning Pipeline

Here's what a complete cleaning workflow looks like from start to finish:

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv")

# 1. Inspect
print(df.info())
print(df.isna().sum())
print(df.describe())

# 2. Fix column names
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# 3. Drop useless columns and duplicates
df.drop(columns=["not_useful"], inplace=True)
df.drop_duplicates(inplace=True)

# 4. Handle missing values
df.dropna(subset=["phone_number"], inplace=True)
df["email"].fillna("unknown@unknown.com", inplace=True)
df["sales"] = df["sales"].interpolate()

# 5. Fix data types
df["date"]     = pd.to_datetime(df["date"])
df["price"]    = pd.to_numeric(df["price"], errors="coerce")
df["zip_code"] = df["zip_code"].astype(str)

# 6. Clean strings
df["name"] = df["name"].str.strip().str.title()
df["phone_number"] = (
    df["phone_number"]
    .str.replace(r"\D", "", regex=True)
    .str.replace(r"(\d{3})(\d{3})(\d{4})", r"\1-\2-\3", regex=True)
    .replace("", "000-000-0000")
)

# 7. Fix invalid values and outliers
df.loc[df["age"] < 0, "age"] = None
df["score"] = df["score"].clip(0, 100)

# 8. Standardize categories
df["status"] = df["status"].replace({"Y": "Yes", "N": "No", "y": "Yes"})

# 9. Filter rows
df = df[df["do_not_contact"] != "Yes"]

# 10. Reset index
df = df.reset_index(drop=True)

print(df.info())
print(df.head())
Enter fullscreen mode Exit fullscreen mode

Now Combine — Merging & Joining DataFrames

Once each source is clean, the next challenge is combining them. In the real world, data almost never lives in a single file. You might have orders in one table, customers in another, and products in a third. Merging is how you bring those pieces together.

Pandas provides three main tools for this:

Tool What it does
merge() Joins two DataFrames by matching on a shared column
concat() Stacks DataFrames vertically or side by side
join() Joins on the index instead of a column

We'll use these two DataFrames for all the examples:

customers = pd.DataFrame({
    "ID":        [1, 2, 3, 4],
    "Name":      ["Alice", "Bob", "Carol", "Dave"],
    "Country":   ["USA", "UK", "India", "USA"],
})

orders = pd.DataFrame({
    "ID":        [1, 2, 3, 5],
    "Product":   ["Laptop", "Phone", "Tablet", "Monitor"],
    "Price":     [1200, 800, 600, 400],
})
Enter fullscreen mode Exit fullscreen mode

Notice that ID=4 (Dave) exists in customers but not in orders, and ID=5 exists in orders but not in customers. This is exactly the kind of mismatch you encounter in practice — and why the how= parameter matters so much.


1️⃣ merge() — The Main Tool

merge() works like a SQL JOIN. It aligns two DataFrames by finding rows where a shared column has the same value.

Basic syntax:

customers.merge(orders, how="inner", on="ID")
Enter fullscreen mode Exit fullscreen mode

The how= Parameter — All Four Types

# INNER → only rows that match in both DataFrames (default)
customers.merge(orders, how="inner", on="ID")
# Result: IDs 1, 2, 3 only — Dave (4) and Monitor (5) are excluded

# LEFT → all rows from the left + matched rows from the right
customers.merge(orders, how="left", on="ID")
# Result: all 4 customers — Dave gets NaN for Product and Price

# RIGHT → all rows from the right + matched rows from the left
customers.merge(orders, how="right", on="ID")
# Result: all 4 orders — Monitor (ID=5) gets NaN for Name and Country

# OUTER → all rows from both, NaN where no match
customers.merge(orders, how="outer", on="ID")
# Result: all 5 unique IDs — gaps filled with NaN on both sides
Enter fullscreen mode Exit fullscreen mode
how= Keeps from Left Keeps from Right NaN Possible
inner Matched only Matched only No
left All rows Matched only Right side
right Matched only All rows Left side
outer All rows All rows Both sides

CROSS merge — every row with every row:

customers.merge(orders, how="cross")
# Produces 4 × 4 = 16 rows — every customer paired with every order
# No on= parameter — it will raise an error if you include it
Enter fullscreen mode Exit fullscreen mode

The on= Parameter — What to Match On

# Match on a single column
customers.merge(orders, how="inner", on="ID")

# Match on multiple columns — both must match simultaneously
customers.merge(orders, how="inner", on=["ID", "Country"])

# Auto-detect — Pandas finds all shared column names automatically
customers.merge(orders)
Enter fullscreen mode Exit fullscreen mode

Be careful with auto-detect. If both DataFrames happen to share a column that you don't intend to join on (like a generic "Name" column), Pandas will include it in the match condition and produce unexpected results. Explicit is safer.


Handling Shared Column Names — _x and _y

If both DataFrames have a column with the same name, but you only join on ID, Pandas keeps both copies and renames them automatically:

# Both DataFrames have a "Name" column, but you only merge on "ID"
customers.merge(orders, how="inner", on="ID")
# Result: Name_x (from customers), Name_y (from orders)
Enter fullscreen mode Exit fullscreen mode

Three ways to fix this:

# Option 1 — merge on all shared columns so no duplicates appear
customers.merge(orders, how="inner", on=["ID", "Name"])

# Option 2 — use custom suffixes instead of _x and _y
customers.merge(orders, how="inner", on="ID", suffixes=("_customer", "_order"))

# Option 3 — drop the shared column from one DataFrame before merging
orders_clean = orders.drop(columns=["Name"])
customers.merge(orders_clean, how="inner", on="ID")
Enter fullscreen mode Exit fullscreen mode

Other Useful Parameters

# left_on / right_on — when the key column has different names in each DataFrame
customers.merge(orders, how="inner", left_on="CustomerID", right_on="OrderID")

# indicator — adds a column showing where each row came from
customers.merge(orders, how="outer", on="ID", indicator=True)
# _merge column values: "left_only", "right_only", "both"
Enter fullscreen mode Exit fullscreen mode

The indicator=True parameter is especially useful for auditing: after a merge, you can immediately see which rows matched, which came only from the left, and which came only from the right.


2️⃣ concat() — Stacking DataFrames

While merge() combines DataFrames horizontally by matching values, concat() simply stacks them — either rows on top of rows, or columns side by side.

pd.concat([df1, df2])        # stack vertically (default)
pd.concat([df1, df2], axis=1)  # stack horizontally
Enter fullscreen mode Exit fullscreen mode

axis= Parameter

# axis=0 → stack rows vertically (append one DataFrame below another)
monthly_data = pd.concat([january, february, march], axis=0)

# axis=1 → stack columns side by side
combined = pd.concat([names_df, scores_df], axis=1)
Enter fullscreen mode Exit fullscreen mode

join= Parameter

# outer (default) → keep all columns, NaN where a column doesn't exist in one source
pd.concat([df1, df2], join="outer")

# inner → keep only columns that exist in all DataFrames
pd.concat([df1, df2], join="inner")
Enter fullscreen mode Exit fullscreen mode

ignore_index= Parameter

# Without ignore_index — original indices are preserved, which can repeat
pd.concat([df1, df2])
# Index: 0, 1, 2, 0, 1, 2  ← 0, 1, 2 repeats

# With ignore_index=True — fresh sequential index
pd.concat([df1, df2], ignore_index=True)
# Index: 0, 1, 2, 3, 4, 5  ← clean
Enter fullscreen mode Exit fullscreen mode

Always use ignore_index=True after stacking unless you specifically need the original indices.

keys= Parameter

# Label each source so you can identify where each row came from
combined = pd.concat([df1, df2], keys=["source_a", "source_b"])
Enter fullscreen mode Exit fullscreen mode

This creates a MultiIndex where the first level is your label and the second is the original row index — useful when you need to trace data back to its source after combining.

verify_integrity= Parameter

# Raises an error if any duplicate index values are found after concatenation
pd.concat([df1, df2], verify_integrity=True)
Enter fullscreen mode Exit fullscreen mode

merge() vs concat() — When to Use Which

merge() concat()
Direction Horizontal (by matching) Vertical or horizontal (by stacking)
Needs a shared key column Yes No
Best for Joining related tables Combining same-structure data
Equivalent to SQL JOIN SQL UNION

A simple rule: if you're combining columns from different tables, use merge(). If you're stacking more rows of the same structure, use concat().


3️⃣ join() — Index-Based Joining

join() works like merge(), but it joins on the index instead of a regular column. You need to set the index first.

customers_indexed = customers.set_index("ID")
orders_indexed    = orders.set_index("ID")

customers_indexed.join(orders_indexed, how="inner")
Enter fullscreen mode Exit fullscreen mode

The how= options are the same as merge(): inner, left, right, outer.

Why merge() Is Almost Always Better

merge() join()
Join on any column ✅ Yes ❌ No — index only
Join on index ✅ Yes ✅ Yes
Requires set_index() first ❌ No ✅ Yes
Simpler syntax ✅ Yes ⚠️ Extra steps

Use join() only when your data is already indexed — for example, after a groupby() that produced a meaningful index. Otherwise, merge() is cleaner and more flexible.


Common Patterns

Clean first, then merge

The most important workflow: clean each source independently before combining them.

# Clean customers
customers["Name"] = customers["Name"].str.strip().str.title()
customers = customers.drop_duplicates(subset=["ID"])

# Clean orders
orders["Price"] = pd.to_numeric(orders["Price"], errors="coerce")
orders = orders.dropna(subset=["Price"])

# Now merge
result = customers.merge(orders, how="left", on="ID")
Enter fullscreen mode Exit fullscreen mode

Merge and immediately filter

# Keep only rows that matched in both DataFrames (inner), then filter by price
result = customers.merge(orders, how="inner", on="ID")
result = result[result["Price"] > 500]
Enter fullscreen mode Exit fullscreen mode

Combine monthly files with concat, then analyze

import glob

all_files = glob.glob("sales_*.csv")
monthly_dfs = [pd.read_csv(f) for f in all_files]
full_year = pd.concat(monthly_dfs, ignore_index=True)

# Now clean and analyze the full year
full_year["Date"] = pd.to_datetime(full_year["Date"])
full_year.drop_duplicates(inplace=True)
Enter fullscreen mode Exit fullscreen mode

Complete Summary Table

Function Category Purpose
df.info() Inspection Column types and null counts
df.describe() Inspection Statistical summary
df.isna().sum() Inspection Count missing values per column
df.value_counts() Inspection Frequency of each unique value
df.duplicated() Inspection Identify duplicate rows
df.fillna() Missing Values Fill NaN with a value
df.dropna() Missing Values Remove rows with NaN
df.interpolate() Missing Values Estimate NaN from neighbors
df.drop_duplicates() Duplicates Remove duplicate rows
df.drop() Structure Remove rows or columns
df.rename() Structure Rename specific columns
df.columns.str.* Structure Clean all column names at once
df.astype() Types Change column data type
pd.to_numeric() Types Convert to number, coerce errors
pd.to_datetime() Types Convert to datetime
str.strip() Strings Remove characters from edges
str.replace() Strings Replace characters in text
str.split() Strings Split into multiple columns
str.extract() Strings Extract regex pattern
str.contains() Strings Filter rows by text match
str.len() Strings Get text length per row
df.replace() Values Replace whole cell values
df.loc[cond, col] Values Overwrite values matching condition
df.where() Values Keep valid values, NaN the rest
df.clip() Outliers Cap values within a valid range
np.where() New Columns Column based on one condition
np.select() New Columns Column based on multiple conditions
df.apply() New Columns Custom function per row or column
df.reset_index() Structure Reset index after dropping rows
merge() Combining Join two DataFrames by matching column
concat() Combining Stack DataFrames vertically or horizontally
join() Combining Join two DataFrames by index

This is Part 3 of the Pandas for Data Science series. Next up: Part 4 — Data Visualization with Matplotlib & Seaborn


Refrences

Top comments (0)