It's a tale as old as time: your company has multiple instances of the same dataset which should match, but for whatever reason, don't. An obvious example that comes to mind might be discrepancies between environments: your staging environment probably falls out of sync with production often. This is an easy problem to solve because it's clear why these types of systems get out of sync. Solving the problem is simple as syncing one system to the other, usually where one system serves as the "source of truth." But what about discrepancies between systems without a source of truth? What do we do when parallel systems should have comparable data, but don't? Lastly, how can we figure out what went wrong?
Companies are prone to tons of scenarios for this exact problem, and they're usually the kind of situations that are deadly when unresolved. For example, let's say your company tracks new hires by adding them to an HR system (such as Bamboo or Workday). These new hires need access to internal company systems (like Confluence, or anything behind your company's VPN). Your company prides itself on being "scrappy," thus depends on an IT department to provision these users across systems manually. Assuming nobody has made a user error, these systems will have matching user information... until those hires are terminated.
No matter the situation, you're nearly guaranteed to run into this issue at some point. Luckily for us, the solution isn't too complicated.
Setting the Stage
The assumption here is that we’re comparing the rows in our data. We need two datasets which have matching columns, but different entries. Sticking to our employee example, I'm going to use two fake datasets containing employee information as such:
id | first_name | last_name | gender | department | |
---|---|---|---|---|---|
424 | Regine | Aberkirder | raberkirderbr@shop-pro.jp | Female | Marketing |
989 | Jessa | Acome | jacomerg@ustream.tv | Female | Business Development |
318 | Marci | Acres | macres8t@paypal.com | Female | Marketing |
996 | Olympe | Acutt | oacuttrn@flickr.com | Female | Support |
516 | Baxy | Adamides | badamideseb@google.pl | Male | Research and Development |
743 | Melba | Addey | maddeykm@vimeo.com | Female | Accounting |
207 | Willie | Addionisio | waddionisio5q@hhs.gov | Male | Human Resources |
764 | Noby | Agutter | nagutterl7@ow.ly | Male | Accounting |
950 | Avery | Ahlin | aahlinqd@ihg.com | Male | Research and Development |
434 | Richart | Akid | rakidc1@ow.ly | Male | Engineering |
863 | Kalindi | Alebrooke | kalebrookeny@biglobe.ne.jp | Female | Product Management |
250 | Marley | Alen | malen6x@fema.gov | Female | Legal |
481 | Wilfrid | Algie | walgiedc@paypal.com | Male | Product Management |
We'll make two Pandas DataFrames from these similar data sets:
df1 = pd.read_csv('data/employees1.csv')
df2 = pd.read_csv('data/employees2.csv')
Now let's get to work.
Pandas Merge With Indicators
The first piece of magic is as simple as adding a keyword argument to a Pandas "merge." When merging two DataFrames in Pandas, setting indicator=True
adds a column to the merged DataFame where the value of each row can be one of three possible values: left_only , right_only , or both :
id | first_name | last_name | gender | department | _merge | |
---|---|---|---|---|---|---|
990 | Yanaton | Tumioto | ytumiotol@ehow.com | Male | Training | right_only |
962 | Lara | Minichillo | lminichillom@photobucket.com | Female | Accounting | right_only |
976 | Reine | Haldon | rhaldonn@bbc.co.uk | Female | Services | right_only |
961 | Lyman | Marcq | lmarcqp@merriam-webster.com | Male | Engineering | right_only |
221 | Findlay | Danett | fdanettr@ibm.com | Male | Human Resources | both |
675 | Muhammad | Rimbault | mrimbaults@feedburner.com | Male | Research and Development | left_only |
692 | Mindy | Ruggen | mruggent@unc.edu | Female | Marketing | left_only |
669 | Isidoro | Reedie | ireedieu@so-net.ne.jp | Male | Business Development | left_only |
783 | Niels | Strase | nstrasev@utexas.edu | Male | Legal | both |
120 | Jock | Bucktharp | jbucktharpw@stanford.edu | Male | Research and Development | both |
734 | Leanora | Sigert | lsigertx@so-net.ne.jp | Female | Sales | both |
52 | Bea | Behneke | bbehnekey@163.com | Female | Support | both |
563 | Cindee | Moorcroft | cmoorcroftz@hao123.com | Female | Support | both |
172 | Kelvin | Clubb | kclubb10@gmpg.org | Male | Product Management | both |
362 | Stella | Hirtz | shirtz11@hibu.com | Female | Legal | both |
273 | Archibold | Eager | aeager12@macromedia.com | Male | Services | both |
As you might imagine, rows marked with a value of " both" in the _ merge column denotes rows which are common to both DataFrames._ left_only and right_only mark rows which were present in either the left or right DataFrame, respectively.
We already have a big piece of the puzzle solved. Let's build off of this to create a reusable function which returns exactly what we're looking for each time.
Creating a Code Snippet
Let's build a function called dataframe_difference()
which answers any of 4 questions:
- Which rows were only present in the first DataFrame?
- Which rows were only present in the second DataFrame?
- Which rows were present in both DataFrames?
- Which rows were not present in both DataFrames, but present in one of them?
For starters, our function dataframe_difference()
will need to be passed two DataFrames to compare. To make things interesting, let's add an optional keyword argument which allows us to return rows for each of the four scenarios above:
def dataframe_difference(df1, df2, which=None):
"""Find rows which are different between two DataFrames."""
comparison_df = df1.merge(df2,
indicator=True,
how='outer')
if which is None:
diff_df = comparison_df[comparison_df['_merge'] != 'both']
else:
diff_df = comparison_df[comparison_df['_merge'] == which]
diff_df.to_csv('data/diff.csv')
return diff_df
If we call our function without a keyword argument (ie: dataframe_difference(df1, df2)
), our function will answer question #4:
id | first_name | last_name | gender | department | _merge | |
---|---|---|---|---|---|---|
990 | Yanaton | Tumioto | ytumiotol@ehow.com | Male | Training | right_only |
962 | Lara | Minichillo | lminichillom@photobucket.com | Female | Accounting | right_only |
976 | Reine | Haldon | rhaldonn@bbc.co.uk | Female | Services | right_only |
961 | Lyman | Marcq | lmarcqp@merriam-webster.com | Male | Engineering | right_only |
675 | Muhammad | Rimbault | mrimbaults@feedburner.com | Male | Research and Development | left_only |
692 | Mindy | Ruggen | mruggent@unc.edu | Female | Marketing | left_only |
669 | Isidoro | Reedie | ireedieu@so-net.ne.jp | Male | Business Development | left_only |
746 | Rhianon | Sloan | rsloan16@ovh.net | Female | Sales | left_only |
952 | Jennilee | Palle | jpalle19@state.gov | Female | Marketing | right_only |
970 | Nedda | Grieger | ngrieger1c@quantcast.com | Female | Marketing | right_only |
As expected, values in the merge column contain left_only and right_only , but not both. The opposite of this would be answering question #3: which rows were present in both DataFrames? To accomplish this, we can pass which='both'
to our function. dataframe_difference(df1, df2, which='both')
would look something like this:
id | first_name | last_name | gender | department | _merge | |
---|---|---|---|---|---|---|
863 | Marcia | Whalebelly | mwhalebelly1f@jugem.jp | Female | Accounting | both |
470 | Averell | Letteresse | aletteresse1g@slashdot.org | Male | Accounting | both |
125 | Berri | Bullough | bbullough1h@comsenz.com | Female | Product Management | both |
258 | Vassili | Drinkwater | vdrinkwater1i@posterous.com | Male | Legal | both |
701 | Laetitia | Sarah | lsarah1j@nationalgeographic.com | Female | Human Resources | both |
507 | Hendrika | Malec | hmalec1k@ifeng.com | Female | Support | both |
163 | Haskell | Chiplen | hchiplen1l@house.gov | Male | Engineering | both |
90 | Gardiner | Bonhome | gbonhome1m@163.com | Male | Human Resources | both |
705 | Arley | Scapens | ascapens1n@sitemeter.com | Male | Product Management | both |
127 | Dacey | Burrett | dburrett1o@nydailynews.com | Female | Training | both |
Passing which='left_only'
and which='right_only'
do exactly what you'd expect.
Feel free to take this snippet and use it as your own. I've added it to Github as a Gist here.
BONUS ROUND: Uploading Missing Rows to SQL Table
We can easily apply the function we just created to help us sync rows between two database tables. If we want to make sure rows between two SQL tables match, we can do something like this:
df1 = pd.read_sql_table('table1',
con=engine,
index_col='id')
df2 = pd.read_sql_table('table2',
con=engine,
index_col='id')
diff_df = dataframe_difference(df1, df2)
diff_df.to_sql('table2',
con=self.engine,
index=True,
index_label='id',
if_exists='append')
Pandas' to_sql()
method has a nifty keyword argument called if_exists. If we set if_exists='append'
when using to_sql()
, Pandas will append rows from a DataFrame to an existing SQL table. Since we used our dataframe_difference()
function to find which rows were different, we were able to ensure that we only uploaded rows that were different.
Top comments (0)