When it comes to data science and analysis, being able to prepare and transform our data is a critical component of any successful project
So let's learn how we can leverage the pandas
pipe
method in Python to abstract complex data transformations into easy-to-read, self documenting operations!
Table of Contents
- Overview of the .pipe() method
- A concrete example of the pipe method
- The benefits of using the pipe operation
- Conclusion
- Additional resources
Overview of the .pipe() method
import pandas as pd
The pipe
method allows us to chain Series
or DataFrame
data transformations together in a semantically continuous pipeline of inputs and outputs
It accomplishes this by leveraging Python's support for higher-order functions - the ability to pass a function as an argument to another function
Let's take a look at a simple example (NOTE: assume the functions and DataFrame
are pre-defined offscreen):
transformed_df = (
df
.pipe(_select_columns)
.pipe(_multiply_columns_by_two)
.pipe(_filter_segments)
)
The code snippet above shows each pipe method:
- Inputting the output from the previous pipe
- Performing a transformation (i.e. selecting columns)
- Chaining the output into the input of the next pipe
"Wait I still don't understand what any of this means!!! Can we take a look at a more concrete example?!"
No worries! Yeah - let's take a look at a more concrete example in the next section
A concrete example of the pipe method
Let's pretend we have a DataFrame
, let's call it town_df
, that contains weekly time-series data for how much electricity every single town in the United States consumes
import pandas as pd
town_df = pd.read_csv("time_series_data_for_every_single_town_in_the_united_states.csv")
And let's say we want to perform these specific transformations in this specific order:
- select relevant columns
- filter date range
- approximate missing values
- map town to state
- aggregate up to week and state
- upsample week frequency to daily
- interpolate daily values
Wouldn't it be great if we could implement each of those steps as it's own self-contained function and then *pipe* those functions together in an explicitly obvious chain of transformations?...
Well I'm glad you asked (😉)! Check this out:
transformed_df = (
df
.pipe(_select_relevant_columns)
.pipe(_filter_date_range)
.pipe(_approximate_missing_values)
.pipe(_map_town_to_state)
.pipe(_aggregate_up_to_week_and_state)
.pipe(_upsample_week_frequency_to_daily)
.pipe(_interpolate_daily_values)
)
And that's it!
A clear and concise chain of immediately obvious data transformations - let's talk about some of the benefits of writing our code like this
The benefits of using the pipe method
You may have noticed that I did not explicitly reveal any of the implementation details behind any of the piped functions
And yet you probably didn't have a hard time understanding (at least from a top-level view) of what transformations were taking place behind the scenes!
I bet you could even show this to someone that has never written a single line of code in their life and even they'd be able to get the overall gist of what's happening to the dataset
While some simple transformations can be accomplished in a single line of code, more complex transformations might take dozens, hundreds, or even thousands of lines before we can move onto the "next" transformation
transformed_df = (
df
.pipe(_some_oneliner_transformation)
.pipe(_some_million_lines_of_code_transformation_but_guess_what_you_dont_have_to_know_how_its_implemented)
)
So being able to abstract the implementation details under a well-defined unit or block of code removes the cognitive overhead of having to read every single line to know what's going on - you can just focus on the big picture
And when something (inevitably) does go wrong you're able to isolate, test, and debug your inputs and outputs because they're already logically isolated into well-defined units
Conclusion
If you want to take this a step further and practice with sample code and data, I've pulled together a full working example for you to explore on GitHub!
Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website 😄
Cheers!
Top comments (6)
Nice Work! I've seen people use pipe() method in scikit-learn, pytorch, etc, but it didn't occur to me that pipe() method can also be used in pandas until your post. Thank you Chris! By the way, the design of your website is amazing!
No problem Julie, so glad I could help!
pipe
is one of my favorite tools to use inpandas
, it can help sooo much with readability/maintainability and I actually only discovered it fairly recently! Def one of my favorite tools to use nowadays when working in Python/pandas
And haha omg thank you so much for checking my site out!! A lot of sweat and tears went into it 😅
Thank you for your introduction. I'll have a try of the
pipe
method. It really makes the process more clear.I like and admire your website design, the home page game(I may need more time to figure it out haha) and the photo planet, Wow! So Wonderful!
May I ask a question? Is the underscore at the beginning of the function name a must?(for example,
_
in_select_relevant_columns
or_filter_date_range
) Thank you!Hey Julie, please always ask questions - I love to help! :D
Fantastic question!! It is not a requirement at all, it's more a matter of personal preference (and a little bit of convention) - as long as Python considers it a valid function definition then you can pass it into
pipe
Prefixing functions with an underscore indicates to other users that those functions are intended to be private and are more for internal implementation details than for external users to call upon. Python does not enforce this rule it's just a convention that some developers follow for readability
I personally really like doing it because I'm often working with dozens of files and thousands of lines of Python and its useful to know when a function is intended for internal use only versus importing into other modules
I hope this answers your question, always feel free to reach out!
Thanks a lot, Chris! Very helpful !
When I read your post again, I find that the
pipe()
method in pandas is a little different from that in scikit-learn and pytorch. In those libraries thepipe()
method is used aspipe(function1, function2)
. Here in pandas thepipe()
method is used as a general interface to control the data flow(df.pipe(function1), df.pipe(function2)
).