Extending the Pandas DataFrame

#datascience #python #pandas #tutorial

Pandas offers a lot of convenient utilities for handling tabular data. The pandas DataFrame is versatile, but some operations can only be done in a single command. I'll offer some advice on how you can extend DataFrame to better suit your workflow. Also, when I say DataFrame, I'm referring to the class in the pandas library.

Suppose you had a collection of students, and you'd like to select all the students that have an "A" in Ms. Frizzle's class.

Here's one way to do this in pandas:

(students.query("grade == 'A'")
         .query("teacher == 'Frizzle'"))

But what if we had a custom method?

students.select_by_grade_and_teacher("A", "Frizzle")

Note: by enclosing the operations in parentheses, you can chain several pandas operations on multiple lines, since they all return DataFrames. This is called method-chaining.

What Extending Means

All I want to do is add custom methods to the standard pandas DataFrame. Custom methods are useful because they:

♻️ simplify repeatable, multi-line logic
⛓ chain with built-in methods in DataFrame

So extending just means adding additional methods to DataFrame; these methods can return another DataFrame, or anything else.

Note: We could write a function, but that doesn't fit into the method-chaining workflow, producing more lines of code

How to Extend `DataFrame`

Since we'd like to add new methods to DataFrame without losing the old ones, we should consider subclassing DataFrame. That is, defining a new class that inherits from the DataFrame class.

This comes with a few caveats. We need to make sure:

🏠 it keeps all of the methods in DataFrame
🤝 when a custom method is called, it returns our class, and not DataFrame (i.e. not the subclass)

The second point ensures we can continue to use custom methods for method-chaining.

The following class does just that:

import pandas as pd

class ExtendedDataFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        # use the __init__ method from DataFrame to ensure
        # that we're inheriting the correct behavior
        super(ExtendedDataFrame, self).__init__(*args, **kwargs)

    # this method is makes it so our methods return an instance
    # of ExtendedDataFrame, instead of a regular DataFrame
    @property
    def _constructor(self):
        return ExtendedDataFrame

    # now define a custom method!
    # note that `self` is a DataFrame
    def select_by_grade_and_teacher(self, grade, teacher):
        return (self
                .query("grade == @grade")
                .query("teacher == @teacher"))

See it in action in this repl.it

Note: The pandas documentation has a page on extensions, but it's quite advanced, and includes many other topics

When to Extend

This is a useful pattern for data exploration, especially in Jupyter Notebooks, or any environment that has code-completion. You can use it to:

⏲ shorten highly-repeated tasks
👩🏽‍🎓 as a utility packaged with a specific dataset, to share with your team... especially if they're not pandas experts like you
📊 construct methods that make plots outside of the standard functionality

Final Thoughts

I do not recommend using this pattern in production, since it's not officially endorsed by pandas. Furthermore, you run the risk of conflicting with the current/future DataFrame API.

I'll leave you with an extended DataFrame that I often use. It has some more advanced features, and all of the methods begin with my initials, pj_ (so it's easier to find in code-completion).

Link to my DataFrame

Feel free to ask any questions, and let me know if you find something interesting!