DEV Community

Dror Atariah
Dror Atariah

Posted on

The Smart Way Pandas Handles Overlapping Column Names

import pandas as pd
Enter fullscreen mode Exit fullscreen mode

We start with two data frames like those:

df1 = pd.DataFrame(
    {
        "id_left": [1, 2, 3],
        "first_name": ["Alice", "Bob", "Charlie"],
        "last_name": ["Smith", "Jones", "Brown"],
    }
)
df2 = pd.DataFrame(
    {
        "id_right": [1, 2, 3],
        "last_name": ["SmithX", "JonesX", "BrownX"],
        "age": [25, 30, 35],
    }
)
Enter fullscreen mode Exit fullscreen mode

Specifically:

df1
Enter fullscreen mode Exit fullscreen mode
id_left first_name last_name
0 1 Alice Smith
1 2 Bob Jones
2 3 Charlie Brown

And,

df2
Enter fullscreen mode Exit fullscreen mode
id_right last_name age
0 1 SmithX 25
1 2 JonesX 30
2 3 BrownX 35

Next, we want to merge these two data frames. But, note that the column last_name appears in both AND they contain different values. This can happen when the left table is obtained using one processing flow and the right table using another flow. This way or the other, the described situation leads to an ambiguity which in turn leads to a challenge. For example, in Spark you will need to be very careful when working with the resulting data frame.

Let's merge:

df1.merge(df2, left_on="id_left", right_on="id_right", how="left")
Enter fullscreen mode Exit fullscreen mode
id_left first_name last_name_x id_right last_name_y age
0 1 Alice Smith 1 SmithX 25
1 2 Bob Jones 2 JonesX 30
2 3 Charlie Brown 3 BrownX 35

Nice! The columns with the identical names were suffixed with _x and _y respectively. This is thanks to the default values of the parameter suffixes. You can find it in the documentation. Personally, I find it rather smart solution that adheres to the Python principal of "explicit is better than implicit".

What do you think?

Top comments (0)