DEV Community

PJ Trainor
PJ Trainor

Posted on

Extending the Pandas DataFrame

Pandas offers a lot of convenient utilities for handling tabular data. The pandas DataFrame is versatile, but some operations can only be done in a single command. I'll offer some advice on how you can extend DataFrame to better suit your workflow. Also, when I say DataFrame, I'm referring to the class in the pandas library.

Suppose you had a collection of students, and you'd like to select all the students that have an "A" in Ms. Frizzle's class.

Here's one way to do this in pandas:

(students.query("grade == 'A'")
         .query("teacher == 'Frizzle'"))

But what if we had a custom method?

students.select_by_grade_and_teacher("A", "Frizzle")

Note: by enclosing the operations in parentheses, you can chain several pandas operations on multiple lines, since they all return DataFrames. This is called method-chaining.

What Extending Means

All I want to do is add custom methods to the standard pandas DataFrame. Custom methods are useful because they:

  • ♻️ simplify repeatable, multi-line logic
  • ⛓ chain with built-in methods in DataFrame

So extending just means adding additional methods to DataFrame; these methods can return another DataFrame, or anything else.

Note: We could write a function, but that doesn't fit into the method-chaining workflow, producing more lines of code

How to Extend DataFrame

Since we'd like to add new methods to DataFrame without losing the old ones, we should consider subclassing DataFrame. That is, defining a new class that inherits from the DataFrame class.

This comes with a few caveats. We need to make sure:

  • 🏠 it keeps all of the methods in DataFrame
  • 🤝 when a custom method is called, it returns our class, and not DataFrame (i.e. not the subclass)

The second point ensures we can continue to use custom methods for method-chaining.

The following class does just that:

import pandas as pd

class ExtendedDataFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        # use the __init__ method from DataFrame to ensure
        # that we're inheriting the correct behavior
        super(ExtendedDataFrame, self).__init__(*args, **kwargs)

    # this method is makes it so our methods return an instance
    # of ExtendedDataFrame, instead of a regular DataFrame
    @property
    def _constructor(self):
        return ExtendedDataFrame

    # now define a custom method!
    # note that `self` is a DataFrame
    def select_by_grade_and_teacher(self, grade, teacher):
        return (self
                .query("grade == @grade")
                .query("teacher == @teacher"))

See it in action in this repl.it

Note: The pandas documentation has a page on extensions, but it's quite advanced, and includes many other topics

When to Extend

This is a useful pattern for data exploration, especially in Jupyter Notebooks, or any environment that has code-completion. You can use it to:

  • ⏲ shorten highly-repeated tasks
  • 👩🏽‍🎓 as a utility packaged with a specific dataset, to share with your team... especially if they're not pandas experts like you
  • 📊 construct methods that make plots outside of the standard functionality

Final Thoughts

I do not recommend using this pattern in production, since it's not officially endorsed by pandas. Furthermore, you run the risk of conflicting with the current/future DataFrame API.

I'll leave you with an extended DataFrame that I often use. It has some more advanced features, and all of the methods begin with my initials, pj_ (so it's easier to find in code-completion).

Link to my DataFrame

Feel free to ask any questions, and let me know if you find something interesting!

Top comments (0)