I'm honestly not sure if that is to be expected or if it's a general issue. It is a problem when doing row comparisons that can't be done with existing row-wise methods like diff.
Consider this example showing three functions doing row-wise differences on a randomly generated dataframe:
import pandas as pd
import random as rd
ITER = 1000
rd.seed()
D = {'A':[]}
for i in range(ITER):
D['A'].append(rd.randint(0,100))
P = pd.DataFrame.from_dict(D)
def Test1(P):
K = zip(P.A)
S = [0]
for i in range(ITER-1):
S.append(K[i+1][0] - K[i][0])
return pd.merge(P,pd.DataFrame(S),left_index=True,right_index=True)
def Test2(P):
S = [0]
for i in range(ITER-1):
S.append(P.iloc[i+1][0] - P.iloc[i][0])
return pd.merge(P,pd.DataFrame(S),left_index=True,right_index=True)
def Test3(P):
return pd.merge(P,P.A.diff().to_frame(),left_index=True,right_index=True)
Now of course Test3 is the way to do it correctly in this special case, but this is just meant as an example.
Here's the output of doing a timeit on all three methods:
%timeit(Test1(P))
1000 loops, best of 3: 2 ms per loop
%timeit(Test2(P))
1 loop, best of 3: 315 ms per loop
%timeit(Test3(P))
1000 loops, best of 3: 1.3 ms per loop
And it shows how extremely slow iterating over iloc is compared to iterating over a list that has been created using zip on the dataframe column.
output of pd.show_versions()
python: 2.7.12.final.0 pandas: 0.18.1
Comment From: jreback
You are using it in about the most inefficient way possible. You are doing multiple operations and creating intermediate Series in the middle. Generally .iloc
accepts quite a variety of input for flexibility. Row iterating is never recommended.
In [32]: P.iloc[0]
Out[32]:
A 73
Name: 0, dtype: int64
In [33]: P.iloc[0][0]
Out[33]: 73