import pandas as pd
We start with two data frames like those:
df1 = pd.DataFrame(
{
"id_left": [1, 2, 3],
"first_name": ["Alice", "Bob", "Charlie"],
"last_name": ["Smith", "Jones", "Brown"],
}
)
df2 = pd.DataFrame(
{
"id_right": [1, 2, 3],
"last_name": ["SmithX", "JonesX", "BrownX"],
"age": [25, 30, 35],
}
)
Specifically:
df1
id_left | first_name | last_name | |
---|---|---|---|
0 | 1 | Alice | Smith |
1 | 2 | Bob | Jones |
2 | 3 | Charlie | Brown |
And,
df2
id_right | last_name | age | |
---|---|---|---|
0 | 1 | SmithX | 25 |
1 | 2 | JonesX | 30 |
2 | 3 | BrownX | 35 |
Next, we want to merge these two data frames. But, note that the column last_name
appears in both AND they contain different values. This can happen when the left table is obtained using one processing flow and the right table using another flow. This way or the other, the described situation leads to an ambiguity which in turn leads to a challenge. For example, in Spark
you will need to be very careful when working with the resulting data frame.
Let's merge:
df1.merge(df2, left_on="id_left", right_on="id_right", how="left")
id_left | first_name | last_name_x | id_right | last_name_y | age | |
---|---|---|---|---|---|---|
0 | 1 | Alice | Smith | 1 | SmithX | 25 |
1 | 2 | Bob | Jones | 2 | JonesX | 30 |
2 | 3 | Charlie | Brown | 3 | BrownX | 35 |
Nice! The columns with the identical names were suffixed with _x
and _y
respectively. This is thanks to the default values of the parameter suffixes
. You can find it in the documentation. Personally, I find it rather smart solution that adheres to the Python principal of "explicit is better than implicit".
What do you think?
Top comments (0)