Now that we learned about reading files and basic DataFrame operations from the second post, and aggregation and grouping from the third one, it's time to learn about merging.
But, what is merge
? In simple words, merge
(a.k.a. "joining") is a database-style join operation by columns or indexes to join two dataframes. Let's understand when to use merge
and how it works.
When do we need to use merge
operation?
As we know, in real-life data projects is common you get more than one table storing the data. Although you get multiple tables, they can match in some way and give us more details and insights about each observation of a table, and with merge
we can do that. But why doesn't all this information come in only one table? There are a few reasons why it's better to have multiples data tables, some of them are: it avoids redundancy of data, it saves some disk space, it makes it easier to manage the data, it allows your query to be faster because the table is smaller, etc.
How does it works?
pandas provides merge
operation that allows us to combine DataFrames. It's almost the same as SQL's join operations. So, to get all the information from different tables together in only one table we can use merge
.
Here we're going to practice merge
operation with the last dataset we used in the last post, but with modifications - so you need to download this new version (restaurant_orders_v2.csv
) - and one more new dataset (restaurant_products_price.csv
). Then I strongly recommend you to check the section Before we start to download the datasets we are going to work on.
Like in the last post, this one will be more practical so I would appreciate if you code along with me. I try to do these posts independent from each other, here I'm considering that you'll create a new Jupyter Notebook, then I'm going to declare some variables again.
If you have any questions please feel free to use the comments section below.
Note [1]: I made a Jupyter Notebook available on my GitHub with the code used in this post.
Before we start
-
In order to work with pandas, we're going to import this library in the beginning.
import pandas as pd
-
Let's download the two datasets we are going to work on.
Note [2]: This first file is not the same as my previous post. Download this new file version.
!wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_orders_v2.csv
!wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_products_price.csv
-
Now, let's turn them into dataframes and store each one in a variable.
column_names_orders = ['order_number', 'order_date', 'item_name', 'quantity'] orders_v2 = pd.read_csv('restaurant_orders_v2.csv', delimiter=',', names=column_names_orders) column_names_prices = ['item_name', 'product_price'] products_price = pd.read_csv('restaurant_products_price.csv', delimiter=',', names=column_names_prices)
If you would like to know details about this process, you can go through the second post in Reading a File section.
-
Check these dataframes and try to understand the information that each one has.
orders_v2.head()
orders_v2
dataframe is just a new version of the dataframe we used on the last post. The modification in this second version is about a huge number of values that were deleted to reduce our dataset and one column that was removed from the first dataframe version, so now we have a dataframe with 4 columns, which are:order_number
,order_date
,item_name
, andquantity
. I believe that theorders_v2
column names are self-explanatory, but if you have any questions about the dataset, let me know through the comments.
products_price.head()
As you can see there are just two columns in this
products_price
dataframe, and they represent the product name (item_name
) and the product price (product_price
), respectively.
Merging Operation
Now that we discussed the motivation of merge
operation, let's going to deep into it, to know the different types and learn how to apply it on the datasets you downloaded.
Overview
Basically, there are 4 types of merge: outer join, inner join, left join and right join. In the image below we can see the purpose of each one.
This information you will pass in how
parameter.
Another important parameter to know is right_on
and left_on
. They allocate the column from the right DataFrame and the one from left DataFrame, respectively, which will match the values. If you don't pass these parameters, pandas will merge through those columns that have the same name.
Note [3]: In case you don't understand, no worries! It's going to be more clear with the code examples.
Inner: intersection of dataframes
An inner join will get only the rows (observations) from the keys that have in both dataframes.
By default merge
pandas operation consider inner join. Then you can omit this information and the result will be the same. And because we have columns in each dataframe with the same name, pandas will consider them as the keys and will merge through them. Then you can also omit this information.
Let's merge our two dataframes and take just the purchases that the item_name
is registered in both tables. Check it below.
orders_v2.merge(products_price)
orders_v2.merge(products_price, how='inner',
right_on='item_name', left_on='item_name')
Note [4]: I recommend you use how
, right_on
, and left_on
, in this way your code will be more explicit.
As you can see we have just 25 observations, remember that the inner operation will join observations that are present in the two datasets.
Outer: union of dataframes
An outer join will get all the rows (observations) independently if the keys are present in just one dataframe, regardless of whether there is a match. In case the key is present in one dataframe the columns of another dataframe will be filled by NaN
values.
Let's merge our two dataframes and take all the purchases independently if the item_name
is registered in both tables. Here you will see that we have more observations (36), but some of them will appear with NaN
values.
orders_v2.merge(products_price, how='outer',
right_on='item_name', left_on='item_name')
Right: keys from right dataframe
A right join will get just the rows (observations) from the keys of the right dataframe, regardless of whether there is a match. If the key has no information on the left dataframe, the rest of the columns will be filled by NaN
values.
Let's merge our two dataframes and take all the information from the products_price
dataframe independently if the item_name
was bought in some order on the orders_v2
dataframe. In this case, you will see that some products will appear with their values but there are no orders to them.
orders_v2.merge(products_price, how='right',
right_on='item_name', left_on='item_name')
Left: keys from left dataframe
A left join will get just the rows (observations) from the keys of the left dataframe, regardless of whether there is a match. If the key has no information on the right dataframe, the rest of the columns will be filled by NaN
values.
Let's merge our two dataframes and take all the purchases from the orders_v2
dataframe independently if the item_name
is registered in the products_price
dataframe. In this case, we will have some orders with products that have no prices registered.
orders_v2.merge(products_price, how='left',
right_on='item_name', left_on='item_name')
merge
vs join
By default the join
method considers the index to join dataframes or a specific column from the dataframe that it's called on (left dataframe). Then it means that the column from the left dataframe doesn't have to be an index, but for the right dataframe, the key must be its index. In general, I would say that the join
method is based on the index.
Otherwise, by default, the merge
method will look for overlapping columns in which to merge on, unless we attribute True
value to right_index
or left_index
parameters. merge
allows controlling over merge keys through right_on
and left_on
parameters, as we saw in this post. merge
is useful when we don’t want to join on the index. Differently from join
, it will return a combined dataframe in which the original index will be destroyed.
Wrapping up
This was the fourth post of my pandas series where I could show to you that merge
is a join operation that is very useful. We learned about some reasons to use it.
Also, we could see how it works. Through some practical examples, we also saw the basic parameters of merge
operation and what are the default ones.
One of these parameters was how
that allocates the merge type. We learned that there are 4 types of merge that allows us to get a set of our joining dataframes that makes more sense for us. The 4 merges are inner (by default), outer, right and left.
Other parameters were right_on
and left_on
that are about the columns that will represent the keys in each dataframe.
Finally, we learned a little bit about the difference between merge
and join
in pandas. And we saw that the biggest difference is that join
is based on indexes while merge
is based on columns.
In the next post, we'll see more about Data Wrangling. See you there!
Top comments (0)