Your orders table has 50,000 rows.
Your customers table has 8,000 rows.
Your products table has 300 rows.
The answer to your question lives across all three. Which customers in which cities bought which products at what price last month.
No single table has everything. You need to combine them. That is what this post is about.
Three Tools, Three Situations
Pandas gives you three main ways to combine DataFrames.
merge is for combining tables that share a common column, like a SQL join. This is what you will use 80% of the time.
concat is for stacking DataFrames together, either adding more rows or more columns. Use it when you have data split across multiple files with the same structure.
join is a convenience wrapper around merge that works on the index. Useful sometimes, but merge is more explicit and flexible.
merge: The Core Operation
import pandas as pd
import numpy as np
customers = pd.DataFrame({
"customer_id": [1, 2, 3, 4, 5],
"name": ["Alex", "Priya", "Sam", "Jordan", "Lisa"],
"city": ["Mumbai", "Delhi", "Bangalore", "Mumbai", "Chennai"]
})
orders = pd.DataFrame({
"order_id": [101, 102, 103, 104, 105, 106],
"customer_id": [1, 2, 1, 3, 6, 4],
"amount": [2500, 4200, 1800, 3100, 900, 5500],
"product": ["Laptop", "Phone", "Headphones", "Tablet", "Mouse", "Monitor"]
})
print("Customers:")
print(customers)
print("\nOrders:")
print(orders)
Output:
Customers:
customer_id name city
0 1 Alex Mumbai
1 2 Priya Delhi
2 3 Sam Bangalore
3 4 Jordan Mumbai
4 5 Lisa Chennai
Orders:
order_id customer_id amount product
0 101 1 2500 Laptop
1 102 2 4200 Phone
2 103 1 1800 Headphones
3 104 3 3100 Tablet
4 105 6 900 Mouse
5 106 4 5500 Monitor
Notice order 105 has customer_id = 6 which does not exist in the customers table. And customer 5 (Lisa) has no orders. These edge cases are where join types matter.
Inner Join: Only What Matches
inner = pd.merge(customers, orders, on="customer_id", how="inner")
print(inner)
Output:
customer_id name city order_id amount product
0 1 Alex Mumbai 101 2500 Laptop
1 1 Alex Mumbai 103 1800 Headphones
2 2 Priya Delhi 102 4200 Phone
3 3 Sam Bangalore 104 3100 Tablet
4 4 Jordan Mumbai 106 5500 Monitor
Five rows. Customer 5 (Lisa, no orders) is gone. Order 105 (customer_id 6, no customer) is gone. Only rows that had a match in both tables survived.
Inner join is the default. When in doubt, start here.
Left Join: Keep All Rows From the Left
left = pd.merge(customers, orders, on="customer_id", how="left")
print(left)
Output:
customer_id name city order_id amount product
0 1 Alex Mumbai 101.0 2500.0 Laptop
1 1 Alex Mumbai 103.0 1800.0 Headphones
2 2 Priya Delhi 102.0 4200.0 Phone
3 3 Sam Bangalore 104.0 3100.0 Tablet
4 4 Jordan Mumbai 106.0 5500.0 Monitor
5 5 Lisa Chennai NaN NaN NaN
Lisa is back. She has no orders, so her order columns are NaN. Every customer from the left table appears at least once. Left join is what you want when you need a complete list of left-table records regardless of whether they have matches.
The order with customer_id 6 is still gone because it had no match in the left (customers) table.
Right Join: Keep All Rows From the Right
right = pd.merge(customers, orders, on="customer_id", how="right")
print(right)
Output:
customer_id name city order_id amount product
0 1.0 Alex Mumbai 101 2500 Laptop
1 1.0 Alex Mumbai 103 1800 Headphones
2 2.0 Priya Delhi 102 4200 Phone
3 3.0 Sam Bangalore 104 3100 Tablet
4 6.0 NaN NaN 105 900 Mouse
5 4.0 Jordan Mumbai 106 5500 Monitor
All orders appear. Order 105 (mystery customer_id 6) is back with NaN for name and city. Lisa is gone because she is only in the left table and right join keeps all rows from the right.
Right join is just a left join with the tables swapped. Most people just swap the tables and use left join. Less confusing.
Outer Join: Keep Everything
outer = pd.merge(customers, orders, on="customer_id", how="outer")
print(outer)
Output:
customer_id name city order_id amount product
0 1.0 Alex Mumbai 101.0 2500.0 Laptop
1 1.0 Alex Mumbai 103.0 1800.0 Headphones
2 2.0 Priya Delhi 102.0 4200.0 Phone
3 3.0 Sam Bangalore 104.0 3100.0 Tablet
4 4.0 Jordan Mumbai 106.0 5500.0 Monitor
5 5.0 Lisa Chennai NaN NaN NaN
6 6.0 NaN NaN 105.0 900.0 Mouse
Everything from both tables. Unmatched rows from either side get NaN in the other side's columns. Lisa is there with NaN orders. The mystery customer is there with NaN customer details.
Use outer join when you want to audit your data. Finding NaNs after an outer join shows you rows that had no match, which often means data quality problems.
Merging on Different Column Names
Tables do not always share column names even when they share data.
products = pd.DataFrame({
"product_name": ["Laptop", "Phone", "Headphones", "Tablet", "Mouse", "Monitor"],
"category": ["Electronics", "Electronics", "Electronics", "Electronics", "Accessories", "Electronics"],
"cost_price": [35000, 20000, 1500, 18000, 500, 12000]
})
orders_with_product = pd.merge(
orders,
products,
left_on="product",
right_on="product_name",
how="left"
)
print(orders_with_product[["order_id", "product", "amount", "category", "cost_price"]])
Output:
order_id product amount category cost_price
0 101 Laptop 2500 Electronics 35000
1 102 Phone 4200 Electronics 20000
2 103 Headphones 1800 Electronics 1500
3 104 Tablet 3100 Electronics 18000
4 105 Mouse 900 Accessories 500
5 106 Monitor 5500 Electronics 12000
left_on and right_on let you specify different column names from each table. The result includes both columns (product and product_name). Clean that up:
orders_with_product = orders_with_product.drop(columns=["product_name"])
Merging on Multiple Columns
Sometimes one key column is not enough to uniquely identify a match.
sales = pd.DataFrame({
"year": [2023, 2023, 2024, 2024],
"quarter": ["Q1", "Q2", "Q1", "Q2"],
"revenue": [150000, 180000, 210000, 195000]
})
targets = pd.DataFrame({
"year": [2023, 2023, 2024, 2024],
"quarter": ["Q1", "Q2", "Q1", "Q2"],
"target": [140000, 170000, 200000, 220000]
})
combined = pd.merge(sales, targets, on=["year", "quarter"])
combined["hit_target"] = combined["revenue"] >= combined["target"]
print(combined)
Output:
year quarter revenue target hit_target
0 2023 Q1 150000 140000 True
1 2023 Q2 180000 170000 True
2 2024 Q1 210000 200000 True
3 2024 Q2 195000 220000 False
Two columns as the key. Year alone is not unique. Quarter alone is not unique. Year plus quarter together uniquely identify each row.
concat: Stacking DataFrames
Use concat when you have multiple DataFrames with the same structure that you want to combine.
Adding more rows (axis=0):
batch1 = pd.DataFrame({
"name": ["Alex", "Priya", "Sam"],
"score": [88, 92, 76],
"month": ["Jan", "Jan", "Jan"]
})
batch2 = pd.DataFrame({
"name": ["Jordan", "Lisa", "Ravi"],
"score": [85, 91, 79],
"month": ["Feb", "Feb", "Feb"]
})
all_scores = pd.concat([batch1, batch2], ignore_index=True)
print(all_scores)
Output:
name score month
0 Alex 88 Jan
1 Priya 92 Jan
2 Sam 76 Jan
3 Jordan 85 Feb
4 Lisa 91 Feb
5 Ravi 79 Feb
ignore_index=True resets the index to 0 through 5. Without it, you get 0, 1, 2, 0, 1, 2 which causes confusion.
Loading twelve months of CSV files into one DataFrame:
import os
monthly_files = ["jan.csv", "feb.csv", "mar.csv"]
frames = []
for file in monthly_files:
if os.path.exists(file):
df = pd.read_csv(file)
frames.append(df)
if frames:
full_year = pd.concat(frames, ignore_index=True)
This pattern is everywhere. Load multiple files, append to a list, concat once at the end. Never concat inside a loop, that creates a new DataFrame on every iteration and gets very slow on large files.
Adding more columns (axis=1):
names = pd.DataFrame({"name": ["Alex", "Priya", "Sam"]})
scores = pd.DataFrame({"math": [88, 92, 76], "science": [85, 90, 71]})
combined = pd.concat([names, scores], axis=1)
print(combined)
Output:
name math science
0 Alex 88 85
1 Priya 92 90
2 Sam 76 71
axis=1 adds columns side by side. Rows align by index, so make sure both DataFrames have matching indices before using this.
Validating Your Merge
Before trusting a merged result, always check the row count.
print(f"Left rows: {len(customers)}")
print(f"Right rows: {len(orders)}")
print(f"Inner merge: {len(inner)}")
print(f"Left merge: {len(left)}")
print(f"Outer merge: {len(outer)}")
Output:
Left rows: 5
Right rows: 6
Inner merge: 5
Left merge: 6
Outer merge: 7
If inner merge gives you more rows than either original table, you have a many-to-many relationship and your key is not unique. That is usually a sign of duplicate keys or wrong join columns. Check with:
print(customers["customer_id"].duplicated().sum())
print(orders["customer_id"].duplicated().sum())
Also check for unexpected NaNs after a merge. After a left join, NaNs in the right table's columns mean those left rows had no match. Sometimes that is expected. Sometimes it means your keys do not align the way you thought.
A Full Three-Table Join
result = (
customers
.merge(orders, on="customer_id", how="left")
.merge(products, left_on="product", right_on="product_name", how="left")
.drop(columns=["product_name"])
)
result["profit"] = result["amount"] - result["cost_price"]
summary = result.groupby("city").agg(
total_revenue=("amount", "sum"),
total_profit=("profit", "sum"),
num_orders=("order_id", "count")
).sort_values("total_revenue", ascending=False)
print(summary)
Output:
total_revenue total_profit num_orders
city
Mumbai 9800 -3200 3
Delhi 4200 2700 1
Bangalore 3100 -14900 1
Three tables joined. Profit calculated. Grouped by city. This is a complete business analysis pipeline in about ten lines.
A Blog Worth Reading
Towards Data Science has a detailed post by Towards AI called "Pandas Merge, Join, Concatenate and Compare" which walks through every join type with visual examples and edge cases. Very thorough. Search it by that title.
Shane Lynn wrote a widely-shared piece called "Pandas Merge Explained" that covers the gotchas around many-to-many joins and index alignment. Shane Lynn's blog at shanelynn.ie has been a go-to Pandas reference for years. Search "Shane Lynn pandas merge explained".
Try This
Create joins_practice.py.
You have three tables. Create them yourself.
employees = pd.DataFrame({
"emp_id": [1, 2, 3, 4, 5, 6],
"name": ["Alex", "Priya", "Sam", "Jordan", "Lisa", "Ravi"],
"dept_id": [10, 20, 10, 30, 20, 10],
"salary": [55000, 82000, 43000, 95000, 67000, 71000]
})
departments = pd.DataFrame({
"dept_id": [10, 20, 30, 40],
"dept_name": ["Engineering", "Marketing", "Sales", "HR"],
"budget": [500000, 300000, 400000, 200000]
})
projects = pd.DataFrame({
"project_id": [101, 102, 103, 104],
"emp_id": [1, 3, 2, 7],
"project": ["AI Platform", "Data Pipeline", "Campaign Tool", "Legacy System"],
"hours": [120, 80, 95, 60]
})
Do all of this:
Inner join employees with departments on dept_id. Which employee is in which department?
Left join the result with projects. Which employees have no projects? (Look for NaN in project column.)
Right join employees with projects. Which projects have no valid employee? (emp_id 7 does not exist.)
Using the full joined table, calculate total salary cost per department.
Find the department that is over budget. Total salary should not exceed the department budget.
Concat employees with a new batch of employees. Add three new people with dept_ids that already exist. Reset the index. Verify the shape.
What's Next
Three posts left in Phase 3. Next is visualization. You have been staring at numbers in tables. Time to turn them into charts, distributions, and visual patterns that make insights obvious in seconds.
Top comments (0)