DEV Community

loading...
Cover image for Iterate over rows in Pandas

Pandas Iterate Over Rows Iterate over rows in Pandas

courseprobe profile image Course Probe ・4 min read

In this short tutorial we are going to cover How to iterate over rows in a DataFrame in Pandas.

Let’s suppose you have the following Pandas dataframe:

import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df
Enter fullscreen mode Exit fullscreen mode

Output:

c1   c2
0  10  100
1  11  110
2  12  120
Enter fullscreen mode Exit fullscreen mode

If you are looking to iterate over the rows o you can access the elements or values in cells then keep reading so we can show you some examples that might help implementing a pandas loop through rows.

Example 1:

DataFrame.iterrows is a generator which yields both the index and row (as a Series):

import pandas as pd
import numpy as np

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})

for index, row in df.iterrows():
    print(row['c1'], row['c2'])

10 100
11 110
12 120
Enter fullscreen mode Exit fullscreen mode

Example 2:

While iterrows() is a good option, sometimes itertuples() can be much faster:

df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})

%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop

%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop
Enter fullscreen mode Exit fullscreen mode

Resources you might be interested in:

Get the book: Data Science for Beginners

How to iterate efficiently

If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.

In short:

  • As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)

  • Otherwise, use df.itertuples() except if your columns have special characters such as spaces or '-'. See point (2)

  • It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)

  • Only use iterrows() if you cannot the previous solutions. See point (1)

Different methods to iterate over rows in a Pandas dataframe:

Generate a random dataframe with a million rows and 4 columns:

df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)
Enter fullscreen mode Exit fullscreen mode

1) The usual iterrows() is convenient, but damn slow:

start_time = time.clock()
result = 0
for _, row in df.iterrows():
    result += max(row['B'], row['C'])

total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
Enter fullscreen mode Exit fullscreen mode

2) The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row.B, row.C)

total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
Enter fullscreen mode Exit fullscreen mode

3) The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.

start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    result += max(col2, col3)

total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
Enter fullscreen mode Exit fullscreen mode

4) Finally, the named itertuples() is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
Enter fullscreen mode Exit fullscreen mode

Output:

A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.9
Enter fullscreen mode Exit fullscreen mode

source: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas


Other Dev posts:

Discussion (0)

pic
Editor guide