import pandas as pd
We start with two data frames like those:
df1 = pd.DataFrame(
{
"id_left": [1, 2, 3],
"first_name": ["Alice", "Bob", "Charlie"],
"last_name": ["Smith", "Jones", "Brown"],
}
)
df2 = pd.DataFrame(
{
"id_right": [1, 2, 3],
"last_name": ["SmithX", "JonesX", "BrownX"],
"age": [25, 30, 35],
}
)
Specifically:
df1
| id_left | first_name | last_name | |
|---|---|---|---|
| 0 | 1 | Alice | Smith |
| 1 | 2 | Bob | Jones |
| 2 | 3 | Charlie | Brown |
And,
df2
| id_right | last_name | age | |
|---|---|---|---|
| 0 | 1 | SmithX | 25 |
| 1 | 2 | JonesX | 30 |
| 2 | 3 | BrownX | 35 |
Next, we want to merge these two data frames. But, note that the column last_name appears in both AND they contain different values. This can happen when the left table is obtained using one processing flow and the right table using another flow. This way or the other, the described situation leads to an ambiguity which in turn leads to a challenge. For example, in Spark you will need to be very careful when working with the resulting data frame.
Let's merge:
df1.merge(df2, left_on="id_left", right_on="id_right", how="left")
| id_left | first_name | last_name_x | id_right | last_name_y | age | |
|---|---|---|---|---|---|---|
| 0 | 1 | Alice | Smith | 1 | SmithX | 25 |
| 1 | 2 | Bob | Jones | 2 | JonesX | 30 |
| 2 | 3 | Charlie | Brown | 3 | BrownX | 35 |
Nice! The columns with the identical names were suffixed with _x and _y respectively. This is thanks to the default values of the parameter suffixes. You can find it in the documentation. Personally, I find it rather smart solution that adheres to the Python principal of "explicit is better than implicit".
What do you think?
Top comments (0)