Akhilesh

Posted on Apr 26

Combining Datasets: Merge, Join, Concat

#ai #programming #productivity #python

Your orders table has 50,000 rows.

Your customers table has 8,000 rows.

Your products table has 300 rows.

The answer to your question lives across all three. Which customers in which cities bought which products at what price last month.

No single table has everything. You need to combine them. That is what this post is about.

Three Tools, Three Situations

Pandas gives you three main ways to combine DataFrames.

merge is for combining tables that share a common column, like a SQL join. This is what you will use 80% of the time.

concat is for stacking DataFrames together, either adding more rows or more columns. Use it when you have data split across multiple files with the same structure.

join is a convenience wrapper around merge that works on the index. Useful sometimes, but merge is more explicit and flexible.

merge: The Core Operation

import pandas as pd
import numpy as np

customers = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "name":        ["Alex", "Priya", "Sam", "Jordan", "Lisa"],
    "city":        ["Mumbai", "Delhi", "Bangalore", "Mumbai", "Chennai"]
})

orders = pd.DataFrame({
    "order_id":    [101, 102, 103, 104, 105, 106],
    "customer_id": [1, 2, 1, 3, 6, 4],
    "amount":      [2500, 4200, 1800, 3100, 900, 5500],
    "product":     ["Laptop", "Phone", "Headphones", "Tablet", "Mouse", "Monitor"]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

Output:

Customers:
   customer_id   name       city
0            1   Alex     Mumbai
1            2  Priya      Delhi
2            3    Sam  Bangalore
3            4 Jordan     Mumbai
4            5   Lisa    Chennai

Orders:
   order_id  customer_id  amount     product
0       101            1    2500      Laptop
1       102            2    4200       Phone
2       103            1    1800  Headphones
3       104            3    3100      Tablet
4       105            6     900       Mouse
5       106            4    5500     Monitor

Notice order 105 has customer_id = 6 which does not exist in the customers table. And customer 5 (Lisa) has no orders. These edge cases are where join types matter.

Inner Join: Only What Matches

inner = pd.merge(customers, orders, on="customer_id", how="inner")
print(inner)

Output:

   customer_id   name       city  order_id  amount     product
0            1   Alex     Mumbai       101    2500      Laptop
1            1   Alex     Mumbai       103    1800  Headphones
2            2  Priya      Delhi       102    4200       Phone
3            3    Sam  Bangalore       104    3100      Tablet
4            4 Jordan     Mumbai       106    5500     Monitor

Five rows. Customer 5 (Lisa, no orders) is gone. Order 105 (customer_id 6, no customer) is gone. Only rows that had a match in both tables survived.

Inner join is the default. When in doubt, start here.

Left Join: Keep All Rows From the Left

left = pd.merge(customers, orders, on="customer_id", how="left")
print(left)

Output:

   customer_id   name       city  order_id  amount     product
0            1   Alex     Mumbai     101.0  2500.0      Laptop
1            1   Alex     Mumbai     103.0  1800.0  Headphones
2            2  Priya      Delhi     102.0  4200.0       Phone
3            3    Sam  Bangalore     104.0  3100.0      Tablet
4            4 Jordan     Mumbai     106.0  5500.0     Monitor
5            5   Lisa    Chennai       NaN     NaN         NaN

Lisa is back. She has no orders, so her order columns are NaN. Every customer from the left table appears at least once. Left join is what you want when you need a complete list of left-table records regardless of whether they have matches.

The order with customer_id 6 is still gone because it had no match in the left (customers) table.

Right Join: Keep All Rows From the Right

right = pd.merge(customers, orders, on="customer_id", how="right")
print(right)

Output:

   customer_id   name       city  order_id  amount     product
0          1.0   Alex     Mumbai       101    2500      Laptop
1          1.0   Alex     Mumbai       103    1800  Headphones
2          2.0  Priya      Delhi       102    4200       Phone
3          3.0    Sam  Bangalore       104    3100      Tablet
4          6.0    NaN        NaN       105     900       Mouse
5          4.0 Jordan     Mumbai       106    5500     Monitor

All orders appear. Order 105 (mystery customer_id 6) is back with NaN for name and city. Lisa is gone because she is only in the left table and right join keeps all rows from the right.

Right join is just a left join with the tables swapped. Most people just swap the tables and use left join. Less confusing.

Outer Join: Keep Everything

outer = pd.merge(customers, orders, on="customer_id", how="outer")
print(outer)

Output:

   customer_id   name       city  order_id  amount     product
0          1.0   Alex     Mumbai     101.0  2500.0      Laptop
1          1.0   Alex     Mumbai     103.0  1800.0  Headphones
2          2.0  Priya      Delhi     102.0  4200.0       Phone
3          3.0    Sam  Bangalore     104.0  3100.0      Tablet
4          4.0 Jordan     Mumbai     106.0  5500.0     Monitor
5          5.0   Lisa    Chennai       NaN     NaN         NaN
6          6.0    NaN        NaN     105.0   900.0       Mouse

Everything from both tables. Unmatched rows from either side get NaN in the other side's columns. Lisa is there with NaN orders. The mystery customer is there with NaN customer details.

Use outer join when you want to audit your data. Finding NaNs after an outer join shows you rows that had no match, which often means data quality problems.

Merging on Different Column Names

Tables do not always share column names even when they share data.

products = pd.DataFrame({
    "product_name": ["Laptop", "Phone", "Headphones", "Tablet", "Mouse", "Monitor"],
    "category":     ["Electronics", "Electronics", "Electronics", "Electronics", "Accessories", "Electronics"],
    "cost_price":   [35000, 20000, 1500, 18000, 500, 12000]
})

orders_with_product = pd.merge(
    orders,
    products,
    left_on="product",
    right_on="product_name",
    how="left"
)

print(orders_with_product[["order_id", "product", "amount", "category", "cost_price"]])

Output:

   order_id     product  amount     category  cost_price
0       101      Laptop    2500  Electronics       35000
1       102       Phone    4200  Electronics       20000
2       103  Headphones    1800  Electronics        1500
3       104      Tablet    3100  Electronics       18000
4       105       Mouse     900  Accessories         500
5       106     Monitor    5500  Electronics       12000

left_on and right_on let you specify different column names from each table. The result includes both columns (product and product_name). Clean that up:

orders_with_product = orders_with_product.drop(columns=["product_name"])

Merging on Multiple Columns

Sometimes one key column is not enough to uniquely identify a match.

sales = pd.DataFrame({
    "year":    [2023, 2023, 2024, 2024],
    "quarter": ["Q1", "Q2", "Q1", "Q2"],
    "revenue": [150000, 180000, 210000, 195000]
})

targets = pd.DataFrame({
    "year":    [2023, 2023, 2024, 2024],
    "quarter": ["Q1", "Q2", "Q1", "Q2"],
    "target":  [140000, 170000, 200000, 220000]
})

combined = pd.merge(sales, targets, on=["year", "quarter"])
combined["hit_target"] = combined["revenue"] >= combined["target"]
print(combined)

Output:

   year quarter  revenue  target  hit_target
0  2023      Q1   150000  140000        True
1  2023      Q2   180000  170000        True
2  2024      Q1   210000  200000        True
3  2024      Q2   195000  220000       False

Two columns as the key. Year alone is not unique. Quarter alone is not unique. Year plus quarter together uniquely identify each row.

concat: Stacking DataFrames

Use concat when you have multiple DataFrames with the same structure that you want to combine.

Adding more rows (axis=0):

batch1 = pd.DataFrame({
    "name":   ["Alex", "Priya", "Sam"],
    "score":  [88, 92, 76],
    "month":  ["Jan", "Jan", "Jan"]
})

batch2 = pd.DataFrame({
    "name":   ["Jordan", "Lisa", "Ravi"],
    "score":  [85, 91, 79],
    "month":  ["Feb", "Feb", "Feb"]
})

all_scores = pd.concat([batch1, batch2], ignore_index=True)
print(all_scores)

Output:

     name  score month
0    Alex     88   Jan
1   Priya     92   Jan
2     Sam     76   Jan
3  Jordan     85   Feb
4    Lisa     91   Feb
5    Ravi     79   Feb

ignore_index=True resets the index to 0 through 5. Without it, you get 0, 1, 2, 0, 1, 2 which causes confusion.

Loading twelve months of CSV files into one DataFrame:

import os

monthly_files = ["jan.csv", "feb.csv", "mar.csv"]
frames = []

for file in monthly_files:
    if os.path.exists(file):
        df = pd.read_csv(file)
        frames.append(df)

if frames:
    full_year = pd.concat(frames, ignore_index=True)

This pattern is everywhere. Load multiple files, append to a list, concat once at the end. Never concat inside a loop, that creates a new DataFrame on every iteration and gets very slow on large files.

Adding more columns (axis=1):

names = pd.DataFrame({"name": ["Alex", "Priya", "Sam"]})
scores = pd.DataFrame({"math": [88, 92, 76], "science": [85, 90, 71]})

combined = pd.concat([names, scores], axis=1)
print(combined)

Output:

    name  math  science
0   Alex    88       85
1  Priya    92       90
2    Sam    76       71

axis=1 adds columns side by side. Rows align by index, so make sure both DataFrames have matching indices before using this.

Validating Your Merge

Before trusting a merged result, always check the row count.

print(f"Left rows:   {len(customers)}")
print(f"Right rows:  {len(orders)}")
print(f"Inner merge: {len(inner)}")
print(f"Left merge:  {len(left)}")
print(f"Outer merge: {len(outer)}")

Output:

Left rows:   5
Right rows:  6
Inner merge: 5
Left merge:  6
Outer merge: 7

If inner merge gives you more rows than either original table, you have a many-to-many relationship and your key is not unique. That is usually a sign of duplicate keys or wrong join columns. Check with:

print(customers["customer_id"].duplicated().sum())
print(orders["customer_id"].duplicated().sum())

Also check for unexpected NaNs after a merge. After a left join, NaNs in the right table's columns mean those left rows had no match. Sometimes that is expected. Sometimes it means your keys do not align the way you thought.

A Full Three-Table Join

result = (
    customers
    .merge(orders, on="customer_id", how="left")
    .merge(products, left_on="product", right_on="product_name", how="left")
    .drop(columns=["product_name"])
)

result["profit"] = result["amount"] - result["cost_price"]

summary = result.groupby("city").agg(
    total_revenue=("amount", "sum"),
    total_profit=("profit", "sum"),
    num_orders=("order_id", "count")
).sort_values("total_revenue", ascending=False)

print(summary)

Output:

           total_revenue  total_profit  num_orders
city
Mumbai              9800        -3200           3
Delhi               4200         2700           1
Bangalore           3100        -14900          1

Three tables joined. Profit calculated. Grouped by city. This is a complete business analysis pipeline in about ten lines.

A Blog Worth Reading

Towards Data Science has a detailed post by Towards AI called "Pandas Merge, Join, Concatenate and Compare" which walks through every join type with visual examples and edge cases. Very thorough. Search it by that title.

Shane Lynn wrote a widely-shared piece called "Pandas Merge Explained" that covers the gotchas around many-to-many joins and index alignment. Shane Lynn's blog at shanelynn.ie has been a go-to Pandas reference for years. Search "Shane Lynn pandas merge explained".

Try This

Create joins_practice.py.

You have three tables. Create them yourself.

employees = pd.DataFrame({
    "emp_id":     [1, 2, 3, 4, 5, 6],
    "name":       ["Alex", "Priya", "Sam", "Jordan", "Lisa", "Ravi"],
    "dept_id":    [10, 20, 10, 30, 20, 10],
    "salary":     [55000, 82000, 43000, 95000, 67000, 71000]
})

departments = pd.DataFrame({
    "dept_id":   [10, 20, 30, 40],
    "dept_name": ["Engineering", "Marketing", "Sales", "HR"],
    "budget":    [500000, 300000, 400000, 200000]
})

projects = pd.DataFrame({
    "project_id": [101, 102, 103, 104],
    "emp_id":     [1, 3, 2, 7],
    "project":    ["AI Platform", "Data Pipeline", "Campaign Tool", "Legacy System"],
    "hours":      [120, 80, 95, 60]
})

Do all of this:

Inner join employees with departments on dept_id. Which employee is in which department?

Left join the result with projects. Which employees have no projects? (Look for NaN in project column.)

Right join employees with projects. Which projects have no valid employee? (emp_id 7 does not exist.)

Using the full joined table, calculate total salary cost per department.

Find the department that is over budget. Total salary should not exceed the department budget.

Concat employees with a new batch of employees. Add three new people with dept_ids that already exist. Reset the index. Verify the shape.

What's Next

Three posts left in Phase 3. Next is visualization. You have been staring at numbers in tables. Time to turn them into charts, distributions, and visual patterns that make insights obvious in seconds.

DEV Community

Combining Datasets: Merge, Join, Concat

Three Tools, Three Situations

merge: The Core Operation

Inner Join: Only What Matches

Left Join: Keep All Rows From the Left

Right Join: Keep All Rows From the Right

Outer Join: Keep Everything

Merging on Different Column Names

Merging on Multiple Columns

concat: Stacking DataFrames

Validating Your Merge

A Full Three-Table Join

A Blog Worth Reading

Try This

What's Next

Top comments (0)