DEV Community

Paul
Paul

Posted on

JOINING TABLES WITH PANDAS. WHAT IS IT? WHY DO YOU EVEN NEED IT?

Within this resource, we’re going to explore joining tables using a popular python library called Pandas.

WHY JOIN TABLES?

There are several ways to join datasets, but the bigger question is: why bother joining tables at all?

Why not simply work with the necessary table or data frame and avoid the extra hassle?

Well, let’s imagine it this way,

As a data expert, you’re given two datasets: one containing customer information and the other detailing store transactions.

Your task is to identify customers who spent over $999 bucks in the past week to qualify for a Black Friday discount.

Keep in mind that these are separate, existing datasets.

While working directly with the transactions dataset might help identify transactions over $999 bucks, the real challenge lies in identifying who made these transactions.

What are their names and contact details?

These questions can only be answered once you bring the customers dataset into the mix.

See how it works?

Joining datasets is just as important as putting the puzzle pieces together—without it, you’re limited!

But let’s pause for a moment to tackle the one super question you’ve been waiting for.

_Why am I getting these datasets one by one—like they’re dropping clues for a treasure hunt I didn’t sign up for?

Did the data gods see me toss that banana peel into the wrong bin at the park and decide I must suffer for my sins?

Why not just serve me a combined dataset straight from the source? Is that too much to ask?_

I hear you, my friend, and I’m here to help you understand why.

In most cases, databases are designed to store various datasets with different relationship patterns—much like how the customers dataset connects to the transactions dataset.

Apart from scalability, other reasons for practicing dataset separation include:

  1. Access control for sensitive data
  2. Query efficiency
  3. Logical organization
  4. Data ownership
  5. Compliance and regulation amongst others.

TYPES OF JOINING TABLES

Alright, before we wrap things up, let’s quickly go over a few join types you’re likely to come across.

  1. Inner Join
  2. Left Join
  3. Right Join
  4. Outer Join

Don't panic, they're interesting and fun to work with.

Inner Join is one of the simplest and most popular join types, where datasets are merged based on columns that are common to both datasets.

Let’s revisit our customers and transactions datasets:

Customers Dataset
Customer_ID Name      Contact
101         Alice     alice@example.com
102         Bob       bob@example.com
103         Charlie   charlie@example.com
104         Diana     diana@example.com
105         Eve       eve@example.com
Enter fullscreen mode Exit fullscreen mode
Transactions Dataset
Transaction_ID  Customer_ID Amount  Date
201             101         1200    2024-12-01
202             103         1500    2024-12-02
203             104         999     2024-12-03
204             106         750     2024-12-04
205             107         1800    2024-12-05`
Enter fullscreen mode Exit fullscreen mode

As you can see, both datasets contain various diverse columns ranging from customer names to contact and amount to date of transaction respectively, but…at least one column shares it’s presence on both dataset making it a common factor.

And that’s the Customer_ID column.

This commonality helps us perform joins effectively.

So inner join in this case is simply merging both tables on the Customer_ID column, excluding all rows whose customer_ID have no matching values on both tables.

I’ll show you:

import pandas as pd

inner_join_result = pd.merge(customers_df, transactions_df, on="Customer_ID", how="inner")

Enter fullscreen mode Exit fullscreen mode

As demonstrated in the code snippet, performing an inner join requires importing the pandas library and using its merge() method. This method combines the datasets by specifying the on parameter, which indicates the common column to merge on (in this case, Customer_ID). The how="inner" argument specifies the join type which is "inner".

The results of the above inner join would be:

Customer_ID Name    Contact             Transaction_ID  Amount  Date
101         Alice   alice@example.com   201             1200    2024-12-01
103         Charlie charlie@example.com 202             1500    2024-12-02
104         Diana   diana@example.com   203             999 2024-12-03
Enter fullscreen mode Exit fullscreen mode

Let's analyse the result for better understanding:

  • Alice (101) appears in both datasets, so her information and similar rows such as Charlie(103) and Diana(104) are included.
  • Eve (105) and Bob(102) are in the Customers dataset but not in Transactions, so they are excluded.
  • Customer_ID 106 and 107 from the Transactions dataset are excluded because they do not match any rows in the Customers dataset.

In summary, only rows with matching customer_id is the output of our inner join.

On the other hand, rows whose Customer_ID not found on the other merging dataset are completely excluded form the resulting dataset.

how is this useful and real life application

It’s important

Too much to take in?

I’ll explain.

If you wish to explore other types of join with me such as anti-join, semi-join, self-join and many more, do give a like and read here:

Till we talk again.

ciao.

Now go join those Datasets with conviction

Top comments (0)