DEV Community

Cover image for pandas #4: Merging
Gabriela Trindade
Gabriela Trindade

Posted on • Edited on

pandas #4: Merging

Now that we learned about reading files and basic DataFrame operations from the second post, and aggregation and grouping from the third one, it's time to learn about merging.

But, what is merge? In simple words, merge (a.k.a. "joining") is a database-style join operation by columns or indexes to join two dataframes. Let's understand when to use merge and how it works.

When do we need to use merge operation?

As we know, in real-life data projects is common you get more than one table storing the data. Although you get multiple tables, they can match in some way and give us more details and insights about each observation of a table, and with merge we can do that. But why doesn't all this information come in only one table? There are a few reasons why it's better to have multiples data tables, some of them are: it avoids redundancy of data, it saves some disk space, it makes it easier to manage the data, it allows your query to be faster because the table is smaller, etc.

How does it works?

pandas provides merge operation that allows us to combine DataFrames. It's almost the same as SQL's join operations. So, to get all the information from different tables together in only one table we can use merge.

Here we're going to practice merge operation with the last dataset we used in the last post, but with modifications - so you need to download this new version (restaurant_orders_v2.csv) - and one more new dataset (restaurant_products_price.csv). Then I strongly recommend you to check the section Before we start to download the datasets we are going to work on.

Like in the last post, this one will be more practical so I would appreciate if you code along with me. I try to do these posts independent from each other, here I'm considering that you'll create a new Jupyter Notebook, then I'm going to declare some variables again.

If you have any questions please feel free to use the comments section below.

Note [1]: I made a Jupyter Notebook available on my GitHub with the code used in this post.

Before we start

  1. In order to work with pandas, we're going to import this library in the beginning.

     import pandas as pd
    
  2. Let's download the two datasets we are going to work on.
    Note [2]: This first file is not the same as my previous post. Download this new file version.

     !wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_orders_v2.csv
    

    Alt Text

     !wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_products_price.csv
    

    Alt Text

  3. Now, let's turn them into dataframes and store each one in a variable.

     column_names_orders = ['order_number', 'order_date', 'item_name', 'quantity']
     orders_v2 = pd.read_csv('restaurant_orders_v2.csv',
                             delimiter=',', names=column_names_orders)
    
     column_names_prices = ['item_name', 'product_price']
     products_price = pd.read_csv('restaurant_products_price.csv',
                                  delimiter=',', names=column_names_prices)
    

    If you would like to know details about this process, you can go through the second post in Reading a File section.

  4. Check these dataframes and try to understand the information that each one has.

     orders_v2.head()
    

    Alt Text

    orders_v2 dataframe is just a new version of the dataframe we used on the last post. The modification in this second version is about a huge number of values that were deleted to reduce our dataset and one column that was removed from the first dataframe version, so now we have a dataframe with 4 columns, which are: order_number, order_date, item_name, and quantity. I believe that the orders_v2 column names are self-explanatory, but if you have any questions about the dataset, let me know through the comments.

     products_price.head()
    

    Alt Text

    As you can see there are just two columns in this products_price dataframe, and they represent the product name (item_name) and the product price (product_price), respectively.

Merging Operation

Now that we discussed the motivation of merge operation, let's going to deep into it, to know the different types and learn how to apply it on the datasets you downloaded.

Overview

Basically, there are 4 types of merge: outer join, inner join, left join and right join. In the image below we can see the purpose of each one.

Alt Text

This information you will pass in how parameter.

Another important parameter to know is right_on and left_on. They allocate the column from the right DataFrame and the one from left DataFrame, respectively, which will match the values. If you don't pass these parameters, pandas will merge through those columns that have the same name.

Note [3]: In case you don't understand, no worries! It's going to be more clear with the code examples.

Inner: intersection of dataframes

An inner join will get only the rows (observations) from the keys that have in both dataframes.

By default merge pandas operation consider inner join. Then you can omit this information and the result will be the same. And because we have columns in each dataframe with the same name, pandas will consider them as the keys and will merge through them. Then you can also omit this information.

Let's merge our two dataframes and take just the purchases that the item_name is registered in both tables. Check it below.

    orders_v2.merge(products_price)

Alt Text
Alt Text

    orders_v2.merge(products_price, how='inner',
                    right_on='item_name', left_on='item_name')

Alt Text
Alt Text

Note [4]: I recommend you use how, right_on, and left_on , in this way your code will be more explicit.

As you can see we have just 25 observations, remember that the inner operation will join observations that are present in the two datasets.

Outer: union of dataframes

An outer join will get all the rows (observations) independently if the keys are present in just one dataframe, regardless of whether there is a match. In case the key is present in one dataframe the columns of another dataframe will be filled by NaN values.

Let's merge our two dataframes and take all the purchases independently if the item_name is registered in both tables. Here you will see that we have more observations (36), but some of them will appear with NaN values.

    orders_v2.merge(products_price, how='outer',
                    right_on='item_name', left_on='item_name')

Alt Text
Alt Text

Right: keys from right dataframe

A right join will get just the rows (observations) from the keys of the right dataframe, regardless of whether there is a match. If the key has no information on the left dataframe, the rest of the columns will be filled by NaN values.

Let's merge our two dataframes and take all the information from the products_price dataframe independently if the item_name was bought in some order on the orders_v2 dataframe. In this case, you will see that some products will appear with their values but there are no orders to them.

    orders_v2.merge(products_price, how='right',
                    right_on='item_name', left_on='item_name')

Alt Text
Alt Text

Left: keys from left dataframe

A left join will get just the rows (observations) from the keys of the left dataframe, regardless of whether there is a match. If the key has no information on the right dataframe, the rest of the columns will be filled by NaN values.

Let's merge our two dataframes and take all the purchases from the orders_v2 dataframe independently if the item_name is registered in the products_price dataframe. In this case, we will have some orders with products that have no prices registered.

    orders_v2.merge(products_price, how='left',
                    right_on='item_name', left_on='item_name')

Alt Text
Alt Text

merge vs join

By default the join method considers the index to join dataframes or a specific column from the dataframe that it's called on (left dataframe). Then it means that the column from the left dataframe doesn't have to be an index, but for the right dataframe, the key must be its index. In general, I would say that the join method is based on the index.

Otherwise, by default, the merge method will look for overlapping columns in which to merge on, unless we attribute True value to right_index or left_index parameters. merge allows controlling over merge keys through right_on and left_on parameters, as we saw in this post. merge is useful when we don’t want to join on the index. Differently from join, it will return a combined dataframe in which the original index will be destroyed.

Wrapping up

This was the fourth post of my pandas series where I could show to you that merge is a join operation that is very useful. We learned about some reasons to use it.

Also, we could see how it works. Through some practical examples, we also saw the basic parameters of merge operation and what are the default ones.

One of these parameters was how that allocates the merge type. We learned that there are 4 types of merge that allows us to get a set of our joining dataframes that makes more sense for us. The 4 merges are inner (by default), outer, right and left.

Other parameters were right_on and left_on that are about the columns that will represent the keys in each dataframe.

Finally, we learned a little bit about the difference between merge and join in pandas. And we saw that the biggest difference is that join is based on indexes while merge is based on columns.

In the next post, we'll see more about Data Wrangling. See you there!

Alt Text

Top comments (0)