The Smart Way Pandas Handles Overlapping Column Names

#python #pandas

import pandas as pd

We start with two data frames like those:

df1 = pd.DataFrame(
    {
        "id_left": [1, 2, 3],
        "first_name": ["Alice", "Bob", "Charlie"],
        "last_name": ["Smith", "Jones", "Brown"],
    }
)
df2 = pd.DataFrame(
    {
        "id_right": [1, 2, 3],
        "last_name": ["SmithX", "JonesX", "BrownX"],
        "age": [25, 30, 35],
    }
)

Specifically:

df1

	id_left	first_name	last_name
0	1	Alice	Smith
1	2	Bob	Jones
2	3	Charlie	Brown

And,

df2

	id_right	last_name	age
0	1	SmithX	25
1	2	JonesX	30
2	3	BrownX	35

Next, we want to merge these two data frames. But, note that the column last_name appears in both AND they contain different values. This can happen when the left table is obtained using one processing flow and the right table using another flow. This way or the other, the described situation leads to an ambiguity which in turn leads to a challenge. For example, in Spark you will need to be very careful when working with the resulting data frame.

Let's merge:

df1.merge(df2, left_on="id_left", right_on="id_right", how="left")

	id_left	first_name	last_name_x	id_right	last_name_y	age
0	1	Alice	Smith	1	SmithX	25
1	2	Bob	Jones	2	JonesX	30
2	3	Charlie	Brown	3	BrownX	35

Nice! The columns with the identical names were suffixed with _x and _y respectively. This is thanks to the default values of the parameter suffixes. You can find it in the documentation. Personally, I find it rather smart solution that adheres to the Python principal of "explicit is better than implicit".

What do you think?

DEV Community

The Smart Way Pandas Handles Overlapping Column Names

Top comments (0)