You're looping over a DataFrame. It feels natural. It's killing your performance.
# What most tutorials say
for index, row in df.iterrows():
df.at[index, 'tax'] = row['price'] * 0.17
Here's the progression you should actually know:
# Level 1: Vectorization — 10-100x faster
df['tax'] = df['price'] * 0.17
# Level 2: .apply() when logic is conditional
df['tax'] = df['price'].apply(lambda x: x * 0.17 if x > 0 else 0)
# Level 3: np.where — the fastest option
import numpy as np
df['tax'] = np.where(df['price'] > 0, df['price'] * 0.17, 0)
| Method | 1M rows |
|---|---|
.iterrows() |
~480s |
.apply() |
~3s |
Vectorized / np.where
|
~0.04s |
Pandas wraps NumPy. NumPy operates on entire arrays at the C level. The moment you loop row by row, you throw that away.
The shift: don't think "what do I do to each row?" rather you should ask "what transformation applies to this column?"
That's it. Notebooks that took minutes will now run in seconds.
Top comments (0)