I cprofile my python code and found 'shift()' is one of the speed bottleneck.
Then I do below test :
In [27]: df = DataFrame(np.random.randn(50000, 10))
In [28]: timeit df.shift(1) 10 loops, best of 3: 19.7 ms per loop
In [29]: df = DataFrame(np.random.randn(50000, 100))
In [30]: timeit df.shift(1) 1 loops, best of 3: 208 ms per loop
I found the time spent on 'df.shift(1)' is in proportion to len(df.columns) . Can it be improved ? For all columns share same index, why not shift all columns in a batch (instead of one by one column ?)
Comment From: wesm
Calling shift like this results in data being moved into a new object, e.g.:
In [15]: df
Out[15]:
0 1 2 3 4
0 1.072132 -0.702515 -0.554493 0.017083 0.868136
1 0.043361 0.682429 -0.064944 -1.588128 0.584704
2 0.040340 0.346544 1.016149 -0.531106 -0.807536
3 2.104821 -1.281473 -0.272546 -0.101870 -0.242820
4 -0.611266 0.078882 0.290995 0.538874 0.412343
In [16]: df.shift(1)
Out[16]:
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 1.072132 -0.702515 -0.554493 0.017083 0.868136
2 0.043361 0.682429 -0.064944 -1.588128 0.584704
3 0.040340 0.346544 1.016149 -0.531106 -0.807536
4 2.104821 -1.281473 -0.272546 -0.101870 -0.242820
In the second case, the DataFrame contains 10 times as much data, so shift
takes 10 times as long to run. You can do shift
more cheaply by creating an array view, but then if you do, say, df - df.shift(1)
then realignment occurs behind the scenes, so you're just deferring the work
Comment From: halleygithub
Yes, I try to reduce the the running time of diff(). Before timeit, I thought most of the time was spend on dataframe alignment, but actually 'shift(1)' is the major cause. As my application need to 'diff()' the 50000*200 dataframe several times in a round, the end user need to wait 5-6 second to see the result. That is the No.1 bottleneck. Any tips to improve?
Comment From: wesm
I would suggest a custom Cython function to compute the diff in one pass. It' d be great if you could contribute this back to pandas.
Comment From: halleygithub
thanks, I will try ..