Pandas offers a lot of convenient utilities for handling tabular data. The pandas
DataFrame is versatile, but some operations can only be done in a single command. I'll offer some advice on how you can extend
DataFrame to better suit your workflow. Also, when I say
DataFrame, I'm referring to the class in the pandas library.
Suppose you had a collection of
students, and you'd like to select all the students that have an "A" in Ms. Frizzle's class.
Here's one way to do this in pandas:
(students.query("grade == 'A'") .query("teacher == 'Frizzle'"))
But what if we had a custom method?
Note: by enclosing the operations in parentheses, you can chain several pandas operations on multiple lines, since they all return
DataFrames. This is called method-chaining.
All I want to do is add custom methods to the standard pandas
DataFrame. Custom methods are useful because they:
- ♻️ simplify repeatable, multi-line logic
- ⛓ chain with built-in methods in
So extending just means adding additional methods to
DataFrame; these methods can return another
DataFrame, or anything else.
Note: We could write a function, but that doesn't fit into the method-chaining workflow, producing more lines of code
Since we'd like to add new methods to
DataFrame without losing the old ones, we should consider subclassing
DataFrame. That is, defining a new class that inherits from the
This comes with a few caveats. We need to make sure:
- 🏠 it keeps all of the methods in
- 🤝 when a custom method is called, it returns our class, and not
DataFrame(i.e. not the subclass)
The second point ensures we can continue to use custom methods for method-chaining.
The following class does just that:
import pandas as pd class ExtendedDataFrame(pd.DataFrame): def __init__(self, *args, **kwargs): # use the __init__ method from DataFrame to ensure # that we're inheriting the correct behavior super(ExtendedDataFrame, self).__init__(*args, **kwargs) # this method is makes it so our methods return an instance # of ExtendedDataFrame, instead of a regular DataFrame @property def _constructor(self): return ExtendedDataFrame # now define a custom method! # note that `self` is a DataFrame def select_by_grade_and_teacher(self, grade, teacher): return (self .query("grade == @grade") .query("teacher == @teacher"))
See it in action in this repl.it
Note: The pandas documentation has a page on extensions, but it's quite advanced, and includes many other topics
This is a useful pattern for data exploration, especially in Jupyter Notebooks, or any environment that has code-completion. You can use it to:
- ⏲ shorten highly-repeated tasks
- 👩🏽🎓 as a utility packaged with a specific dataset, to share with your team... especially if they're not pandas experts like you
- 📊 construct methods that make plots outside of the standard functionality
I do not recommend using this pattern in production, since it's not officially endorsed by pandas. Furthermore, you run the risk of conflicting with the current/future
I'll leave you with an extended
DataFrame that I often use. It has some more advanced features, and all of the methods begin with my initials,
pj_ (so it's easier to find in code-completion).
Feel free to ask any questions, and let me know if you find something interesting!