I cprofile my python code and found 'shift()' is one of the speed bottleneck.

Then I do below test :

In [27]: df = DataFrame(np.random.randn(50000, 10))

In [28]: timeit df.shift(1) 10 loops, best of 3: 19.7 ms per loop

In [29]: df = DataFrame(np.random.randn(50000, 100))

In [30]: timeit df.shift(1) 1 loops, best of 3: 208 ms per loop

I found the time spent on 'df.shift(1)' is in proportion to len(df.columns) . Can it be improved ? For all columns share same index, why not shift all columns in a batch (instead of one by one column ?)

Comment From: wesm

Calling shift like this results in data being moved into a new object, e.g.:

In [15]: df
Out[15]: 
          0         1         2         3         4
0  1.072132 -0.702515 -0.554493  0.017083  0.868136
1  0.043361  0.682429 -0.064944 -1.588128  0.584704
2  0.040340  0.346544  1.016149 -0.531106 -0.807536
3  2.104821 -1.281473 -0.272546 -0.101870 -0.242820
4 -0.611266  0.078882  0.290995  0.538874  0.412343

In [16]: df.shift(1)
Out[16]: 
          0         1         2         3         4
0       NaN       NaN       NaN       NaN       NaN
1  1.072132 -0.702515 -0.554493  0.017083  0.868136
2  0.043361  0.682429 -0.064944 -1.588128  0.584704
3  0.040340  0.346544  1.016149 -0.531106 -0.807536
4  2.104821 -1.281473 -0.272546 -0.101870 -0.242820

In the second case, the DataFrame contains 10 times as much data, so shift takes 10 times as long to run. You can do shift more cheaply by creating an array view, but then if you do, say, df - df.shift(1) then realignment occurs behind the scenes, so you're just deferring the work

Comment From: halleygithub

Yes, I try to reduce the the running time of diff(). Before timeit, I thought most of the time was spend on dataframe alignment, but actually 'shift(1)' is the major cause. As my application need to 'diff()' the 50000*200 dataframe several times in a round, the end user need to wait 5-6 second to see the result. That is the No.1 bottleneck. Any tips to improve?

Comment From: wesm

I would suggest a custom Cython function to compute the diff in one pass. It' d be great if you could contribute this back to pandas.

Comment From: halleygithub

thanks, I will try ..