DEV Community

Nirmal
Nirmal

Posted on

How i used a simple python script to compare 2 huge csv file using Pandas

Recently i came across a requirement to compare a column data in a csv file with another csv file. These files contains 13 columns with 65 million of rows.

Initially i thought it's simple one and used basic scripting with bash to process line by lines. Then, i realized it took me hours to get that processed. Then i moved to nodejs and then to python. No matter what language i used the result is same. It took more time and i felt it's not worth to process using the regular approach. Then i turned towards pandas library and i tried. To my surprise it processed within 10 seconds.

The requirement is basically to match the emailId in the file1.csv against the md5 email hash of the file2.csv.

Here the snippet i used to compare the matched dataset.

import pandas as pd

import hashlib 

def md5hash(mailId):
    print(mailId)
    result = hashlib.md5(mailId)
    return result.digest()

df=pd.read_csv("file1.csv",dtype={ "emailId": "string"},low_memory=False)

# First get the has of the emailId and append to the existing row
df['md5_hashed'] = [hashlib.md5(val.encode()).hexdigest() for val in df['emailId']]
df.to_csv("hashed_records.csv",index=False)


#Store the data in a dataframe
df1=pd.read_csv("file1_with_hashed_email.csv",low_memory=False)
df2=pd.read_csv("file2.csv")
df2_hash=df2["md5_emailId"]

matched=df.loc[df1['md5_hashed'].isin(df2_hash)]
matched.to_csv("non-matched.csv",index=False)

#Just add a ~ to filter the unmatched
unmatched=df.loc[~df['md5_hashed'].isin(df2_hash)]
unmatched.to_csv("non-matched.csv",index=False)

Enter fullscreen mode Exit fullscreen mode

I worked with Pandas earlier. But i realized the power of pandas which has saved my hours instead of parsing with nodejs/db based solutions.

Hope this helps someone who faced similar scenarios.

Top comments (0)