Hussein Mahdi

Posted on Mar 18

Mastering Pandas — Part 3: Data Cleaning, Merging & Joining

#python #programming #datascience #tutorial

Pandas for Data Science Series — Article #3

Real Data Is Never Clean

In Part 2, you learned how to group and slice a DataFrame with precision. But in real projects, the data you receive rarely arrives ready to analyze. Column names have spaces. Ages are negative. Phone numbers mix letters with digits. Duplicate rows sneak in. And your data often lives across multiple files that need to be combined before you can do anything useful.

This article covers the two skills that bridge raw data and real analysis:

Data Cleaning — detecting and fixing everything wrong with your data before it corrupts your results
Merging & Joining — combining separate DataFrames into a single, unified dataset

These topics flow naturally into each other. You clean each source first, then you combine them. That's the order every data analyst follows in practice — and it's the order we'll follow here.

We'll use this sample dataset throughout the cleaning section:

import pandas as pd
import numpy as np

data = {
    "Name":             ["Alice", "bob", "Carol", "Alice", "  dave  ", "Eve"],
    "Age":              [25, -5, 200, 25, 30, None],
    "Phone":            ["055-123-4567", None, "55123x", "055-123-4567", "055-987-0000", "055abc"],
    "Email":            ["alice@mail.com", "bob@mail.com", None, "alice@mail.com", "dave@mail.com", "not-an-email"],
    "Paying_Customer":  ["Y", "N", "Yes", "Y", "N", "y"],
    "Score":            [85, 110, 72, 85, -10, 90],
}
df = pd.DataFrame(data)

Step 1 — Inspect Before You Touch Anything

The most important rule in data cleaning: look before you act. Jumping straight into fixes without understanding the full picture leads to mistakes you won't catch until much later.

These are the inspection tools you should run on every new dataset:

df.shape          # (rows, columns) — how big is this?
df.info()         # column names, data types, non-null counts
df.head()         # first 5 rows
df.tail()         # last 5 rows
df.describe()     # min, max, mean, std — spot impossible values here

For missing values specifically:

df.isna().sum()           # count NaN per column
df.isnull().sum()         # same as isna() — both work
df[df["Phone"].notna()]   # rows that actually have a phone number

For duplicates and inconsistent categories:

df.duplicated()                         # True/False per row
df[df.duplicated()]                     # show only the duplicate rows
df["Paying_Customer"].value_counts()    # reveals: Y, N, Yes, y — all meaning the same thing

Always run describe() on numeric columns — the min and max alone will tell you immediately if something is wrong. A minimum age of -5 or a score of 110 out of 100 is a red flag you can't miss.

Step 2 — Fix Column Names First

Messy column names make everything harder. Clean them before anything else, so every subsequent operation uses predictable, consistent names.

Rename specific columns:

df.rename(columns={"Old Name": "New_Name", "phone number": "Phone"})

Clean all column names at once — the standard one-liner:

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

This strips surrounding spaces, converts to lowercase, and replaces spaces with underscores in a single chain. After this, "First Name" becomes "first_name", " Age " becomes "age", and so on.

Before:              After:
"First Name"    →    "first_name"
"  Age "        →    "age"
"Phone Number"  →    "phone_number"
"ZIP Code"      →    "zip_code"

Step 3 — Handle Missing Values

Missing values appear as NaN in Pandas. Before deciding what to do with them, you need to understand why they're missing — because the right fix depends on the cause.

Fill missing values with a fixed value:

df["Phone"].fillna("000-000-0000")    # placeholder for missing phone
df["Age"].fillna(0)                   # fill numeric with 0
df["Name"].fillna("Unknown")          # fill text with label
df["Score"].fillna(df["Score"].mean()) # fill with column average

Fill missing numeric values by estimating from neighbors:

df["Sales"] = df["Sales"].interpolate()

interpolate() fills NaN by calculating the midpoint between the values above and below it. This is ideal for time series or sensor data where values change gradually — it would be wrong to use for random datasets.

Before:     After interpolate():
100         100
NaN    →    150   ← estimated midpoint
200         200
NaN    →    225   ← estimated midpoint
250         250

Remove rows with missing values:

df.dropna()                            # drop any row with at least one NaN
df.dropna(subset=["Phone"])            # drop only if Phone is NaN
df.dropna(thresh=3)                    # drop rows that have fewer than 3 non-NaN values

inplace rule: use either inplace=True or variable assignment — never both. They do the same thing, and combining them produces a bug where df becomes None.
df.dropna(subset=["Phone"], inplace=True)   # ✅ modifies df directly
df = df.dropna(subset=["Phone"])            # ✅ returns new df
df = df.dropna(subset=["Phone"], inplace=True)  # ❌ df becomes None

Step 4 — Remove Duplicates

Duplicate rows silently inflate counts, averages, and totals. Always check for them.

df.drop_duplicates()                         # remove fully identical rows
df.drop_duplicates(subset=["Name"])          # remove rows with duplicate Name only
df.drop_duplicates(subset=["Name", "Email"]) # remove rows where both columns match
df.drop_duplicates(keep="last")              # keep the last occurrence instead of the first

Before:                        After drop_duplicates():
Name   Email                   Name   Email
Alice  alice@mail.com          Alice  alice@mail.com
Bob    bob@mail.com      →     Bob    bob@mail.com
Alice  alice@mail.com ← dup    Carol  carol@mail.com
Carol  carol@mail.com

Step 5 — Fix Data Types

Pandas sometimes reads numeric columns as strings, or dates as plain text. Operations on wrong types either crash or silently produce wrong results.

Change column type directly:

df["Zip_Code"] = df["Zip_Code"].astype(str)     # treat as text, not a number
df["Age"] = df["Age"].astype(int)               # convert to integer
df["Price"] = df["Price"].astype(float)         # convert to decimal

Handle messy numbers that contain symbols or text:

df["Price"] = pd.to_numeric(df["Price"], errors="coerce")

errors="coerce" converts anything that can't be parsed into NaN instead of crashing. This is the safe way to clean columns like "$1,200" or "N/A" that can't be directly cast with astype().

Before:           After pd.to_numeric(..., errors="coerce"):
"1200"    →       1200.0
"$850"    →       NaN     ← couldn't parse
"700.5"   →       700.5
"N/A"     →       NaN     ← couldn't parse

Convert to datetime:

df["Date"] = pd.to_datetime(df["Date"])
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")  # specify format if needed

Once a column is proper datetime, you unlock sorting by date, filtering by year/month, calculating time differences, and much more.

Step 6 — Clean Strings

String columns are almost always the messiest part of a real dataset. Pandas provides a full .str accessor with everything you need.

Strip unwanted characters from edges:

df["Name"] = df["Name"].str.strip()           # remove spaces from both sides
df["Name"] = df["Name"].str.strip(" ./-")     # remove spaces, dots, slashes, dashes
df["Name"] = df["Name"].str.lstrip()          # left side only
df["Name"] = df["Name"].str.rstrip()          # right side only

Replace characters anywhere in the string:

df["Phone"] = df["Phone"].str.replace(r"[a-zA-Z]", "", regex=True)   # remove all letters
df["Phone"] = df["Phone"].str.replace(r"\D", "", regex=True)          # keep digits only
df["Name"]  = df["Name"].str.replace(r"[./]", "", regex=True)         # remove dots and slashes

Standardize text case:

df["Name"] = df["Name"].str.lower()     # all lowercase → "alice"
df["Name"] = df["Name"].str.upper()     # all uppercase → "ALICE"
df["Name"] = df["Name"].str.title()     # title case    → "Alice"

Check text content:

df[df["Email"].str.contains("@")]           # filter valid-looking emails
df[df["Phone"].str.len() != 12]             # find rows where phone length is wrong
df[df["Name"].str.startswith("A")]          # names starting with A
df[df["Name"].str.endswith("e")]            # names ending with e

Extract a pattern using regex:

df["Zip"] = df["Address"].str.extract(r"(\d{5})")        # pull 5-digit zip code
df["Area_Code"] = df["Phone"].str.extract(r"^(\d{3})")   # pull first 3 digits

Split one column into multiple columns:

df[["Street", "State", "Zip"]] = (
    df["Address"]
    .str.split(",", n=2, expand=True)
)

n=2 limits the split to 2 times, producing exactly 3 parts regardless of how many commas the string contains. expand=True spreads those parts into separate columns.

Before:                              After:
"123 Main St, Texas, 75001"   →   Street="123 Main St"  State="Texas"  Zip="75001"

Step 7 — Replace and Standardize Values

Raw data often has the same value written multiple ways: "Y", "Yes", "yes", "y" — all meaning the same thing. Standardize them.

Replace whole cell values using a dictionary:

df["Status"] = df["Status"].replace({"Y": "Yes", "N": "No", "y": "Yes"})

# Apply to multiple columns at once
df[["Paying_Customer", "Do_Not_Contact"]] = (
    df[["Paying_Customer", "Do_Not_Contact"]]
    .replace({"Y": "Yes", "N": "No", "y": "Yes", "n": "No"})
)

Overwrite invalid values using loc[]:

df.loc[df["Age"] < 0,   "Age"] = None     # negative age → invalid
df.loc[df["Age"] > 120, "Age"] = None     # age over 120 → invalid
df.loc[df["Price"] < 0, "Price"] = None   # negative price → invalid

Keep valid values, replace everything else with NaN:

df["Salary"] = df["Salary"].where(df["Salary"] > 0)
# Rows where Salary <= 0 become NaN; the rest are untouched

Step 8 — Handle Outliers

An outlier is a value that exists but shouldn't — a score of 110/100, an age of 200, a price of -50. clip() is the clean way to enforce valid ranges.

Cap values within a valid range:

df["Score"] = df["Score"].clip(0, 100)      # 110 → 100, -10 → 0
df["Age"]   = df["Age"].clip(0, 120)        # 200 → 120, -5  → 0
df["Age"]   = np.clip(df["Age"], 0, 120)    # NumPy version — same result

Before:   After clip(0, 100):
85        85
110   →   100   ← capped at max
-10   →   0     ← capped at min
72        72

clip() is gentler than dropping rows — it preserves the row but brings the outlier within the valid boundary. Use it when the row has useful data in other columns that you don't want to lose.

Step 9 — Drop Rows and Columns

Sometimes the cleanest fix is simply to remove what you don't need.

Drop rows by index:

df.drop(0)              # remove row at index 0
df.drop([0, 2, 5])      # remove multiple rows by index

Drop columns:

df.drop(columns=["Age", "Phone"])      # preferred style
df.drop("Age", axis=1)                 # axis=1 means "columns"
df.drop(["Age", "Phone"], axis=1)      # multiple columns

Step 10 — Create New Columns

Cleaning isn't only about removing bad data — it's also about building the right structure for analysis.

Fixed value:

df["Source"] = "CSV Import"    # tag every row with its data source

Calculated from another column:

df["Price_With_Tax"] = df["Price"] * 1.15
df["Name_Length"]    = df["Name"].str.len()

Based on a condition:

df["Status"] = np.where(df["Age"] >= 18, "Adult", "Minor")

Based on multiple conditions:

conditions = [df["Score"] >= 90, df["Score"] >= 70, df["Score"] >= 50]
choices    = ["A", "B", "C"]
df["Grade"] = np.select(conditions, choices, default="F")

Custom logic with apply():

df["Name"] = df["Name"].apply(lambda x: x.title() if isinstance(x, str) else x)

def categorize_price(price):
    if price > 1000: return "Expensive"
    elif price > 500: return "Medium"
    else:            return "Cheap"

df["Category"] = df["Price"].apply(categorize_price)

Step 11 — Add New Rows

Add a single row using concat():

new_row = pd.DataFrame([{"Name": "Frank", "Age": 28, "Phone": "055-111-2222"}])
df = pd.concat([df, new_row], ignore_index=True)

Add multiple rows at once:

new_rows = pd.DataFrame([
    {"Name": "Grace", "Age": 31},
    {"Name": "Hank",  "Age": 24},
])
df = pd.concat([df, new_rows], ignore_index=True)

ignore_index=True resets the index after concatenation so there are no gaps or repeated numbers:

Without ignore_index:    With ignore_index=True:
0  Alice                 0  Alice
1  Bob          →        1  Bob
0  Grace                 2  Grace   ← clean
1  Hank                  3  Hank    ← clean

Step 12 — Reset the Index

After dropping rows, the index has gaps. Reset it before analysis or before passing data to another function.

df = df.reset_index(drop=True)

drop=True discards the old index numbers entirely. drop=False keeps the old index as a new column — useful if you need to trace where a row originally came from.

Putting It All Together — A Real Cleaning Pipeline

Here's what a complete cleaning workflow looks like from start to finish:

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv")

# 1. Inspect
print(df.info())
print(df.isna().sum())
print(df.describe())

# 2. Fix column names
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# 3. Drop useless columns and duplicates
df.drop(columns=["not_useful"], inplace=True)
df.drop_duplicates(inplace=True)

# 4. Handle missing values
df.dropna(subset=["phone_number"], inplace=True)
df["email"].fillna("unknown@unknown.com", inplace=True)
df["sales"] = df["sales"].interpolate()

# 5. Fix data types
df["date"]     = pd.to_datetime(df["date"])
df["price"]    = pd.to_numeric(df["price"], errors="coerce")
df["zip_code"] = df["zip_code"].astype(str)

# 6. Clean strings
df["name"] = df["name"].str.strip().str.title()
df["phone_number"] = (
    df["phone_number"]
    .str.replace(r"\D", "", regex=True)
    .str.replace(r"(\d{3})(\d{3})(\d{4})", r"\1-\2-\3", regex=True)
    .replace("", "000-000-0000")
)

# 7. Fix invalid values and outliers
df.loc[df["age"] < 0, "age"] = None
df["score"] = df["score"].clip(0, 100)

# 8. Standardize categories
df["status"] = df["status"].replace({"Y": "Yes", "N": "No", "y": "Yes"})

# 9. Filter rows
df = df[df["do_not_contact"] != "Yes"]

# 10. Reset index
df = df.reset_index(drop=True)

print(df.info())
print(df.head())

Now Combine — Merging & Joining DataFrames

Once each source is clean, the next challenge is combining them. In the real world, data almost never lives in a single file. You might have orders in one table, customers in another, and products in a third. Merging is how you bring those pieces together.

Pandas provides three main tools for this:

Tool	What it does
`merge()`	Joins two DataFrames by matching on a shared column
`concat()`	Stacks DataFrames vertically or side by side
`join()`	Joins on the index instead of a column

We'll use these two DataFrames for all the examples:

customers = pd.DataFrame({
    "ID":        [1, 2, 3, 4],
    "Name":      ["Alice", "Bob", "Carol", "Dave"],
    "Country":   ["USA", "UK", "India", "USA"],
})

orders = pd.DataFrame({
    "ID":        [1, 2, 3, 5],
    "Product":   ["Laptop", "Phone", "Tablet", "Monitor"],
    "Price":     [1200, 800, 600, 400],
})

Notice that ID=4 (Dave) exists in customers but not in orders, and ID=5 exists in orders but not in customers. This is exactly the kind of mismatch you encounter in practice — and why the how= parameter matters so much.

1️⃣ `merge()` — The Main Tool

merge() works like a SQL JOIN. It aligns two DataFrames by finding rows where a shared column has the same value.

Basic syntax:

customers.merge(orders, how="inner", on="ID")

The `how=` Parameter — All Four Types

# INNER → only rows that match in both DataFrames (default)
customers.merge(orders, how="inner", on="ID")
# Result: IDs 1, 2, 3 only — Dave (4) and Monitor (5) are excluded

# LEFT → all rows from the left + matched rows from the right
customers.merge(orders, how="left", on="ID")
# Result: all 4 customers — Dave gets NaN for Product and Price

# RIGHT → all rows from the right + matched rows from the left
customers.merge(orders, how="right", on="ID")
# Result: all 4 orders — Monitor (ID=5) gets NaN for Name and Country

# OUTER → all rows from both, NaN where no match
customers.merge(orders, how="outer", on="ID")
# Result: all 5 unique IDs — gaps filled with NaN on both sides

`how=`	Keeps from Left	Keeps from Right	NaN Possible
`inner`	Matched only	Matched only	No
`left`	All rows	Matched only	Right side
`right`	Matched only	All rows	Left side
`outer`	All rows	All rows	Both sides

CROSS merge — every row with every row:

customers.merge(orders, how="cross")
# Produces 4 × 4 = 16 rows — every customer paired with every order
# No on= parameter — it will raise an error if you include it

The `on=` Parameter — What to Match On

# Match on a single column
customers.merge(orders, how="inner", on="ID")

# Match on multiple columns — both must match simultaneously
customers.merge(orders, how="inner", on=["ID", "Country"])

# Auto-detect — Pandas finds all shared column names automatically
customers.merge(orders)

Be careful with auto-detect. If both DataFrames happen to share a column that you don't intend to join on (like a generic "Name" column), Pandas will include it in the match condition and produce unexpected results. Explicit is safer.

Handling Shared Column Names — `_x` and `_y`

If both DataFrames have a column with the same name, but you only join on ID, Pandas keeps both copies and renames them automatically:

# Both DataFrames have a "Name" column, but you only merge on "ID"
customers.merge(orders, how="inner", on="ID")
# Result: Name_x (from customers), Name_y (from orders)

Three ways to fix this:

# Option 1 — merge on all shared columns so no duplicates appear
customers.merge(orders, how="inner", on=["ID", "Name"])

# Option 2 — use custom suffixes instead of _x and _y
customers.merge(orders, how="inner", on="ID", suffixes=("_customer", "_order"))

# Option 3 — drop the shared column from one DataFrame before merging
orders_clean = orders.drop(columns=["Name"])
customers.merge(orders_clean, how="inner", on="ID")

Other Useful Parameters

# left_on / right_on — when the key column has different names in each DataFrame
customers.merge(orders, how="inner", left_on="CustomerID", right_on="OrderID")

# indicator — adds a column showing where each row came from
customers.merge(orders, how="outer", on="ID", indicator=True)
# _merge column values: "left_only", "right_only", "both"

The indicator=True parameter is especially useful for auditing: after a merge, you can immediately see which rows matched, which came only from the left, and which came only from the right.

2️⃣ `concat()` — Stacking DataFrames

While merge() combines DataFrames horizontally by matching values, concat() simply stacks them — either rows on top of rows, or columns side by side.

pd.concat([df1, df2])        # stack vertically (default)
pd.concat([df1, df2], axis=1)  # stack horizontally

`axis=` Parameter

# axis=0 → stack rows vertically (append one DataFrame below another)
monthly_data = pd.concat([january, february, march], axis=0)

# axis=1 → stack columns side by side
combined = pd.concat([names_df, scores_df], axis=1)

`join=` Parameter

# outer (default) → keep all columns, NaN where a column doesn't exist in one source
pd.concat([df1, df2], join="outer")

# inner → keep only columns that exist in all DataFrames
pd.concat([df1, df2], join="inner")

`ignore_index=` Parameter

# Without ignore_index — original indices are preserved, which can repeat
pd.concat([df1, df2])
# Index: 0, 1, 2, 0, 1, 2  ← 0, 1, 2 repeats

# With ignore_index=True — fresh sequential index
pd.concat([df1, df2], ignore_index=True)
# Index: 0, 1, 2, 3, 4, 5  ← clean

Always use ignore_index=True after stacking unless you specifically need the original indices.

`keys=` Parameter

# Label each source so you can identify where each row came from
combined = pd.concat([df1, df2], keys=["source_a", "source_b"])

This creates a MultiIndex where the first level is your label and the second is the original row index — useful when you need to trace data back to its source after combining.

`verify_integrity=` Parameter

# Raises an error if any duplicate index values are found after concatenation
pd.concat([df1, df2], verify_integrity=True)

`merge()` vs `concat()` — When to Use Which

	`merge()`	`concat()`
Direction	Horizontal (by matching)	Vertical or horizontal (by stacking)
Needs a shared key column	Yes	No
Best for	Joining related tables	Combining same-structure data
Equivalent to	SQL JOIN	SQL UNION

A simple rule: if you're combining columns from different tables, use merge(). If you're stacking more rows of the same structure, use concat().

3️⃣ `join()` — Index-Based Joining

join() works like merge(), but it joins on the index instead of a regular column. You need to set the index first.

customers_indexed = customers.set_index("ID")
orders_indexed    = orders.set_index("ID")

customers_indexed.join(orders_indexed, how="inner")

The how= options are the same as merge(): inner, left, right, outer.

Why `merge()` Is Almost Always Better

	`merge()`	`join()`
Join on any column	✅ Yes	❌ No — index only
Join on index	✅ Yes	✅ Yes
Requires `set_index()` first	❌ No	✅ Yes
Simpler syntax	✅ Yes	⚠️ Extra steps

Use join() only when your data is already indexed — for example, after a groupby() that produced a meaningful index. Otherwise, merge() is cleaner and more flexible.

Common Patterns

Clean first, then merge

The most important workflow: clean each source independently before combining them.

# Clean customers
customers["Name"] = customers["Name"].str.strip().str.title()
customers = customers.drop_duplicates(subset=["ID"])

# Clean orders
orders["Price"] = pd.to_numeric(orders["Price"], errors="coerce")
orders = orders.dropna(subset=["Price"])

# Now merge
result = customers.merge(orders, how="left", on="ID")

Merge and immediately filter

# Keep only rows that matched in both DataFrames (inner), then filter by price
result = customers.merge(orders, how="inner", on="ID")
result = result[result["Price"] > 500]

Combine monthly files with concat, then analyze

import glob

all_files = glob.glob("sales_*.csv")
monthly_dfs = [pd.read_csv(f) for f in all_files]
full_year = pd.concat(monthly_dfs, ignore_index=True)

# Now clean and analyze the full year
full_year["Date"] = pd.to_datetime(full_year["Date"])
full_year.drop_duplicates(inplace=True)

Complete Summary Table

Function	Category	Purpose
`df.info()`	Inspection	Column types and null counts
`df.describe()`	Inspection	Statistical summary
`df.isna().sum()`	Inspection	Count missing values per column
`df.value_counts()`	Inspection	Frequency of each unique value
`df.duplicated()`	Inspection	Identify duplicate rows
`df.fillna()`	Missing Values	Fill NaN with a value
`df.dropna()`	Missing Values	Remove rows with NaN
`df.interpolate()`	Missing Values	Estimate NaN from neighbors
`df.drop_duplicates()`	Duplicates	Remove duplicate rows
`df.drop()`	Structure	Remove rows or columns
`df.rename()`	Structure	Rename specific columns
`df.columns.str.*`	Structure	Clean all column names at once
`df.astype()`	Types	Change column data type
`pd.to_numeric()`	Types	Convert to number, coerce errors
`pd.to_datetime()`	Types	Convert to datetime
`str.strip()`	Strings	Remove characters from edges
`str.replace()`	Strings	Replace characters in text
`str.split()`	Strings	Split into multiple columns
`str.extract()`	Strings	Extract regex pattern
`str.contains()`	Strings	Filter rows by text match
`str.len()`	Strings	Get text length per row
`df.replace()`	Values	Replace whole cell values
`df.loc[cond, col]`	Values	Overwrite values matching condition
`df.where()`	Values	Keep valid values, NaN the rest
`df.clip()`	Outliers	Cap values within a valid range
`np.where()`	New Columns	Column based on one condition
`np.select()`	New Columns	Column based on multiple conditions
`df.apply()`	New Columns	Custom function per row or column
`df.reset_index()`	Structure	Reset index after dropping rows
`merge()`	Combining	Join two DataFrames by matching column
`concat()`	Combining	Stack DataFrames vertically or horizontally
`join()`	Combining	Join two DataFrames by index

This is Part 3 of the Pandas for Data Science series. Next up: Part 4 — Data Visualization with Matplotlib & Seaborn

Refrences

GitHub Repo : https://github.com/Hu8MA/Mastering-Pandas-Reference
Website : https://pandas.pydata.org/
GitHub Library : https://github.com/pandas-dev/pandas
Course : https://youtu.be/Mdq1WWSdUtw

DEV Community

Mastering Pandas — Part 3: Data Cleaning, Merging & Joining

Real Data Is Never Clean

Step 1 — Inspect Before You Touch Anything

Step 2 — Fix Column Names First

Step 3 — Handle Missing Values

Step 4 — Remove Duplicates

Step 5 — Fix Data Types

Step 6 — Clean Strings

Step 7 — Replace and Standardize Values

Step 8 — Handle Outliers

Step 9 — Drop Rows and Columns

Step 10 — Create New Columns

Step 11 — Add New Rows

Step 12 — Reset the Index

Putting It All Together — A Real Cleaning Pipeline

Now Combine — Merging & Joining DataFrames

1️⃣ `merge()` — The Main Tool

The `how=` Parameter — All Four Types

The `on=` Parameter — What to Match On

Handling Shared Column Names — `_x` and `_y`

Other Useful Parameters

2️⃣ `concat()` — Stacking DataFrames

`axis=` Parameter

`join=` Parameter

`ignore_index=` Parameter

`keys=` Parameter

`verify_integrity=` Parameter

`merge()` vs `concat()` — When to Use Which

3️⃣ `join()` — Index-Based Joining

Why `merge()` Is Almost Always Better

Common Patterns

Clean first, then merge

Merge and immediately filter

Combine monthly files with concat, then analyze

Complete Summary Table

Top comments (0)

Real Data Is Never Clean

Step 1 — Inspect Before You Touch Anything

Step 2 — Fix Column Names First

Step 3 — Handle Missing Values

Step 4 — Remove Duplicates

Step 5 — Fix Data Types

Step 6 — Clean Strings

Step 7 — Replace and Standardize Values

Step 8 — Handle Outliers

Step 9 — Drop Rows and Columns

Step 10 — Create New Columns

Step 11 — Add New Rows

Step 12 — Reset the Index

Putting It All Together — A Real Cleaning Pipeline

Now Combine — Merging & Joining DataFrames

1️⃣ merge() — The Main Tool

The how= Parameter — All Four Types

The on= Parameter — What to Match On

Handling Shared Column Names — _x and _y

Other Useful Parameters

2️⃣ concat() — Stacking DataFrames

axis= Parameter

join= Parameter

ignore_index= Parameter

keys= Parameter

verify_integrity= Parameter

merge() vs concat() — When to Use Which

3️⃣ join() — Index-Based Joining

Why merge() Is Almost Always Better

Common Patterns

Clean first, then merge

Merge and immediately filter

Combine monthly files with concat, then analyze

Complete Summary Table

1️⃣ `merge()` — The Main Tool

The `how=` Parameter — All Four Types

The `on=` Parameter — What to Match On

Handling Shared Column Names — `_x` and `_y`

2️⃣ `concat()` — Stacking DataFrames

`axis=` Parameter

`join=` Parameter

`ignore_index=` Parameter

`keys=` Parameter

`verify_integrity=` Parameter

`merge()` vs `concat()` — When to Use Which

3️⃣ `join()` — Index-Based Joining

Why `merge()` Is Almost Always Better