Pandas offers a lot of convenient utilities for handling tabular data. The pandas DataFrame is versatile, but some operations can only be done in a single command. I'll offer some advice on how you can extend DataFrame to better suit your workflow. Also, when I say DataFrame, I'm referring to the class in the pandas library.
Suppose you had a collection of students, and you'd like to select all the students that have an "A" in Ms. Frizzle's class.
Here's one way to do this in pandas:
(students.query("grade == 'A'")
         .query("teacher == 'Frizzle'"))
But what if we had a custom method?
students.select_by_grade_and_teacher("A", "Frizzle")
Note: by enclosing the operations in parentheses, you can chain several pandas operations on multiple lines, since they all return
DataFrames. This is called method-chaining.
What Extending Means
All I want to do is add custom methods to the standard pandas DataFrame. Custom methods are useful because they:
- ♻️ simplify repeatable, multi-line logic
 - ⛓ chain with built-in methods in 
DataFrame 
So extending just means adding additional methods to DataFrame; these methods can return another DataFrame, or anything else.
Note: We could write a function, but that doesn't fit into the method-chaining workflow, producing more lines of code
  
  
  How to Extend DataFrame
Since we'd like to add new methods to DataFrame without losing the old ones, we should consider subclassing DataFrame. That is, defining a new class that inherits from the DataFrame class.
This comes with a few caveats. We need to make sure:
- 🏠 it keeps all of the methods in 
DataFrame - 🤝 when a custom method is called, it returns our class, and not 
DataFrame(i.e. not the subclass) 
The second point ensures we can continue to use custom methods for method-chaining.
The following class does just that:
import pandas as pd
class ExtendedDataFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        # use the __init__ method from DataFrame to ensure
        # that we're inheriting the correct behavior
        super(ExtendedDataFrame, self).__init__(*args, **kwargs)
    # this method is makes it so our methods return an instance
    # of ExtendedDataFrame, instead of a regular DataFrame
    @property
    def _constructor(self):
        return ExtendedDataFrame
    # now define a custom method!
    # note that `self` is a DataFrame
    def select_by_grade_and_teacher(self, grade, teacher):
        return (self
                .query("grade == @grade")
                .query("teacher == @teacher"))
See it in action in this repl.it
Note: The pandas documentation has a page on extensions, but it's quite advanced, and includes many other topics
When to Extend
This is a useful pattern for data exploration, especially in Jupyter Notebooks, or any environment that has code-completion. You can use it to:
- ⏲ shorten highly-repeated tasks
 - 👩🏽🎓 as a utility packaged with a specific dataset, to share with your team... especially if they're not pandas experts like you
 - 📊 construct methods that make plots outside of the standard functionality
 
Final Thoughts
I do not recommend using this pattern in production, since it's not officially endorsed by pandas. Furthermore, you run the risk of conflicting with the current/future DataFrame API.
I'll leave you with an extended DataFrame that I often use. It has some more advanced features, and all of the methods begin with my initials, pj_ (so it's easier to find in code-completion).
Feel free to ask any questions, and let me know if you find something interesting!
    
Top comments (0)