DEV Community

Eyal Trabelsi
Eyal Trabelsi

Posted on

Introducing Pandas-Log: package for debugging pandas operations

The pandas ecosystem has been invaluable for the data science ecosystem, and thus today most data science tasks consist of series of pandas’ steps to transform raw data into an understandable/usable format.

These steps’ accuracy is crucial, and thus understanding the unexpected results becomes crucial as well. Unfortunately, the ecosystem lacks the tools to understand those unexpected results.

That’s why I created Pandas-log, it provides metadata on each operation which will allow pinpointing the issues. For example, after .query it returns the number of rows being filtered.

As always I believe its easier to understand with an example so I will use the pokemon dataset to find “who is the weakest non-legendary fire pokemon?”.

So who is the weakest fire pokemon?

(Link to the Notebook code can be found here)
First, we will import relevant packages and read our pokemon dataset.

import pandas as pd
import numpy as np
import pandas_log
df = pd.read_csv("pokemon.csv")
df.head(10)
Enter fullscreen mode Exit fullscreen mode

To answer our question who is the weakest non-legendary fire pokemon we will need to:

  • Filter out legendary pokemon using .query() .
  • Keep only fire pokemon using .query() .
  • Drop Legendary column using .drop() .
  • Keep the weakest pokemon among them using .nsmallest(). In code, It will look something like:
res = (df.copy()
         .query("legendary==0")
         .query("type_1=='fire' or type_2=='fire'")
         .drop("legendary", axis=1)
         .nsmallest(1,"total"))
res
Enter fullscreen mode Exit fullscreen mode

OH NOO!!! Our code does not work !! We got an empty dataframe!!

If only there was a way to track those issues!? Fortunately, that’s what Pandas-log is for!
with just adding a small context manager to our example we will get relevant information that will help us find the issue printed to stdout.

with pandas_log.enable():
    res = (df.copy()
             .query("legendary==0")
             .query("type_1=='fire' or type_2=='fire'")
             .drop("legendary", axis=1)
             .nsmallest(1,"total"))
Enter fullscreen mode Exit fullscreen mode

After reading the output it’s clear that the issue is in step 2 as we got 0 rows remaining, so something with the predicate “type_1==’fire’ or type_2==’fire’” is wrong. Indeed pokemon type starts with a capital letter, so let’s run the fixed code.

res = (df.copy()
         .query("legendary==0")
         .query("type_1=='Fire' or type_2=='Fire'")
         .drop("legendary", axis=1)
         .nsmallest(1,"total"))
res
Enter fullscreen mode Exit fullscreen mode

Whoala we got Slugma !!!!!!!!

Few last words to say

The package is still in its early stage so it might contain few bugs. Please have a look at the Github repository and suggest some improvements or extensions of the code. I will gladly welcome any kind of constructive feedback and feel free to contribute to Pandas-log as well! 😉

Top comments (0)