DEV Community

Nirmal
Nirmal

Posted on

4 2

How i used a simple python script to compare 2 huge csv file using Pandas

Recently i came across a requirement to compare a column data in a csv file with another csv file. These files contains 13 columns with 65 million of rows.

Initially i thought it's simple one and used basic scripting with bash to process line by lines. Then, i realized it took me hours to get that processed. Then i moved to nodejs and then to python. No matter what language i used the result is same. It took more time and i felt it's not worth to process using the regular approach. Then i turned towards pandas library and i tried. To my surprise it processed within 10 seconds.

The requirement is basically to match the emailId in the file1.csv against the md5 email hash of the file2.csv.

Here the snippet i used to compare the matched dataset.

import pandas as pd

import hashlib 

def md5hash(mailId):
    print(mailId)
    result = hashlib.md5(mailId)
    return result.digest()

df=pd.read_csv("file1.csv",dtype={ "emailId": "string"},low_memory=False)

# First get the has of the emailId and append to the existing row
df['md5_hashed'] = [hashlib.md5(val.encode()).hexdigest() for val in df['emailId']]
df.to_csv("hashed_records.csv",index=False)


#Store the data in a dataframe
df1=pd.read_csv("file1_with_hashed_email.csv",low_memory=False)
df2=pd.read_csv("file2.csv")
df2_hash=df2["md5_emailId"]

matched=df.loc[df1['md5_hashed'].isin(df2_hash)]
matched.to_csv("non-matched.csv",index=False)

#Just add a ~ to filter the unmatched
unmatched=df.loc[~df['md5_hashed'].isin(df2_hash)]
unmatched.to_csv("non-matched.csv",index=False)

Enter fullscreen mode Exit fullscreen mode

I worked with Pandas earlier. But i realized the power of pandas which has saved my hours instead of parsing with nodejs/db based solutions.

Hope this helps someone who faced similar scenarios.

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs