Introducing Pandas-Log: package for debugging pandas operations

Eyal Trabelsi — Fri, 25 Oct 2019 05:03:57 +0000

The pandas ecosystem has been invaluable for the data science ecosystem, and thus today most data science tasks consist of series of pandas’ steps to transform raw data into an understandable/usable format.

These steps’ accuracy is crucial, and thus understanding the unexpected results becomes crucial as well. Unfortunately, the ecosystem lacks the tools to understand those unexpected results.

That’s why I created Pandas-log, it provides metadata on each operation which will allow pinpointing the issues. For example, after .query it returns the number of rows being filtered.

As always I believe its easier to understand with an example so I will use the pokemon dataset to find “who is the weakest non-legendary fire pokemon?”.

So who is the weakest fire pokemon?

(Link to the Notebook code can be found here)
First, we will import relevant packages and read our pokemon dataset.

import pandas as pd
import numpy as np
import pandas_log
df = pd.read_csv("pokemon.csv")
df.head(10)

To answer our question who is the weakest non-legendary fire pokemon we will need to:

Filter out legendary pokemon using .query() .
Keep only fire pokemon using .query() .
Drop Legendary column using .drop() .
Keep the weakest pokemon among them using .nsmallest(). In code, It will look something like:

res = (df.copy()
         .query("legendary==0")
         .query("type_1=='fire' or type_2=='fire'")
         .drop("legendary", axis=1)
         .nsmallest(1,"total"))
res

OH NOO!!! Our code does not work !! We got an empty dataframe!!

If only there was a way to track those issues!? Fortunately, that’s what Pandas-log is for!
with just adding a small context manager to our example we will get relevant information that will help us find the issue printed to stdout.

with pandas_log.enable():
    res = (df.copy()
             .query("legendary==0")
             .query("type_1=='fire' or type_2=='fire'")
             .drop("legendary", axis=1)
             .nsmallest(1,"total"))

After reading the output it’s clear that the issue is in step 2 as we got 0 rows remaining, so something with the predicate “type_1==’fire’ or type_2==’fire’” is wrong. Indeed pokemon type starts with a capital letter, so let’s run the fixed code.

res = (df.copy()
         .query("legendary==0")
         .query("type_1=='Fire' or type_2=='Fire'")
         .drop("legendary", axis=1)
         .nsmallest(1,"total"))
res

Whoala we got Slugma !!!!!!!!

Few last words to say