Recently, I started to compete on some competitions on Kaggle and there was one competition that had certain time restrictions that made me look into optimizing my Pandas calculations.
Being a first year student in data science and coding in python, I had limited experience with Pandas and its inner workings. However, one thing I noticed first hand was how slow it ran when utilizing any some of "for loo". This was detrimental as I was only used to old C++ type coding that utilize "for loops" pretty frequently. Thus, I started to look at different ways to implement calculations on Pandas data frames.
In this post, I will highlight all the various ways to do calculations on Pandas data frames and my picks for optimizing calculations. I will be utilizing a dataset called df_train that has 509762 rows to stimulate the calculations.
Method 1: "for loop" over every element of a Pandas series.
This is the most worst way to do any calculations but it can conceptually be the easiest way to implement a solution. I suggest you avoid this method.
%%time
dx = []
for element in range(len(df_train['Dir'])):
x=0
x =df_train['S'][element]*math.cos(df_train['Dir'][element]*math.pi/180.0)
dx.append(x)
df_train['dx'] = dx
CPU times: user 10.5 s, sys: 91.2 ms, total: 10.6 s
Wall time: 10.8 s
Method 2: Use "iterrow" for row operations.
If you want to iterate over each row of a dataframe this would be the method as it was built to be more compatible with Pandas. Again I would avoid using this method as I actually got slower computation time. (5x slower)
%%time
dx = []
for index,row in df_train.iterrows():
x=0
x =row['S']*math.cos(row['Dir']*math.pi/180.0)
dx.append(x)
df_train['dx'] = dx
CPU times: user 49.3 s, sys: 1.22 s, total: 50.6 s
Wall time: 51.9 s
Method 3: Use apply function.
This method would be my first prefer method as it is conceptually easy and require least amount of code but again the downside is that you have no utilizing vectorization of the numpy array. It is still iterating over each row but does so with a number of internal optimizations, such as using iterators in Cython. For my example, however it was slightly slower than the original method.
%%time
df_train['dx'] = df_train.apply(lambda row: row.S*math.cos((row.Dir)*math.pi/180.0), axis =1)
CPU times: user 14.2 s, sys: 455 ms, total: 14.7 s
Wall time: 15 s
Method 4: Vectorization over Pandas Series
Vectorization mean making the calculation of over the whole numpy array aka do the whole column calculation all at once. This utilize the whole benefit of the numpy library and this method is what we want to strive for.
%%time
df_train['dx'] = df_train.S*np.cos((df_train.Dir)*np.pi/180.0)
CPU times: user 13.7 ms, sys: 3.66 ms, total: 17.4 ms
Wall time: 16.3 ms
As we can see the code is much simpler and it is 100x faster than row iterative methods.
Method 5: Vectorization over Numpy array
The final improvement we can do is convert the Pandas dataframe to an actual array before we do the calculation. It will give a slight boost in calculation speed.
%%time
df_train['dx'] = df_train.S.values*np.cos((df_train.Dir.values)*np.pi/180.0)
CPU times: user 7.87 ms, sys: 3.03 ms, total: 10.9 ms
Wall time: 12.1 ms
Again this method is not necessary but there is a good speed boost.
Conclusion
We want to avoid any row iterative method to perform calculations on Pandas dataframe. However, for certain transformations like string manipulations I found that .apply would be my go to as it is simple and gets the job done.
But if we are utilizing all numpy based operators we need to try Method 4 or 5 as we are truly utilizing the power of the numpy based dataframe.
Top comments (0)