from pandas import DataFrame, Timestamp
from numpy import nan
b = DataFrame({u'average_rrc_connected': {Timestamp('2016-11-29 17:30:00', freq='15T'): nan, Timestamp('2016-11-29 16:00:00', freq='15T'): 0.0, Timestamp('2016-11-29 17:15:00', freq='15T'): nan, Timestamp('2016-11-29 17:00:00', freq='15T'): 0.0, Timestamp('2016-11-29 15:30:00', freq='15T'): nan, Timestamp('2016-11-29 15:45:00', freq='15T'): 0.0, Timestamp('2016-11-29 16:15:00', freq='15T'): 0.0, Timestamp('2016-11-29 16:30:00', freq='15T'): 0.0, Timestamp('2016-11-29 16:45:00', freq='15T'): 0.0}})
In [26]: b
Out[26]:
average_rrc_connected
2016-11-29 15:30:00 NaN
2016-11-29 15:45:00 0.0
2016-11-29 16:00:00 0.0
2016-11-29 16:15:00 0.0
2016-11-29 16:30:00 0.0
2016-11-29 16:45:00 0.0
2016-11-29 17:00:00 0.0
2016-11-29 17:15:00 NaN
2016-11-29 17:30:00 NaN
rb = b.rolling(window=2)
Show the rolling window:
def show_x(x):
print x
return x[1]
In [28]: rs_b = rb.agg(show_x)
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
In [29]: rs_b
Out[29]:
average_rrc_connected
2016-11-29 15:30:00 NaN
2016-11-29 15:45:00 NaN
2016-11-29 16:00:00 0.0
2016-11-29 16:15:00 0.0
2016-11-29 16:30:00 0.0
2016-11-29 16:45:00 0.0
2016-11-29 17:00:00 0.0
2016-11-29 17:15:00 NaN
2016-11-29 17:30:00 NaN
Problem description
In this example, I expect the 15:45 slot to show 0.0 instead of NaN. This is to show windows that include NaNs are automatically aggregated as NaN, even if the aggregation function does something different.
What I'm actually trying to achieve is a kind of "rolling reshape", where "historical" data from the beginning of the window (including NaNs) is placed in columns alongside each other. Perhaps there's a better way to achieve this:
def get_index(i):
f = lambda x: x[i]
f.__name__ = str(i)
return f
rs_b2 = rb.agg({col: {i - 1: get_index(i) for i in range(2)} for col in b.columns})
In [37]: rs_b2
Out[37]:
average_rrc_connected
0 -1
2016-11-29 15:30:00 NaN NaN
2016-11-29 15:45:00 NaN NaN
2016-11-29 16:00:00 0.0 0.0
2016-11-29 16:15:00 0.0 0.0
2016-11-29 16:30:00 0.0 0.0
2016-11-29 16:45:00 0.0 0.0
2016-11-29 17:00:00 0.0 0.0
2016-11-29 17:15:00 NaN NaN
2016-11-29 17:30:00 NaN NaN
Where the expected output would be for the 15:45 slot to have -1: NaN, 0: 0.0
.
Output of pd.show_versions()
Comment From: jorisvandenbossche
@ohadle This is not because the function that is applied on a window that includes a NaN returns a NaN per se, but because we say that the full window should contain values before we apply the function on that window. However, this is the default behaviour, can easily be changed with the min_periods
keyword:
In [23]: b.rolling(window=2).agg(show_x)
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
Out[23]:
average_rrc_connected
2016-11-29 15:30:00 NaN
2016-11-29 15:45:00 NaN
2016-11-29 16:00:00 0.0
2016-11-29 16:15:00 0.0
2016-11-29 16:30:00 0.0
2016-11-29 16:45:00 0.0
2016-11-29 17:00:00 0.0
2016-11-29 17:15:00 NaN
2016-11-29 17:30:00 NaN
In [24]: b.rolling(window=2, min_periods=1).agg(show_x)
[ nan 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. nan]
Out[24]:
average_rrc_connected
2016-11-29 15:30:00 NaN
2016-11-29 15:45:00 0.0
2016-11-29 16:00:00 0.0
2016-11-29 16:15:00 0.0
2016-11-29 16:30:00 0.0
2016-11-29 16:45:00 0.0
2016-11-29 17:00:00 0.0
2016-11-29 17:15:00 NaN
2016-11-29 17:30:00 NaN
Comment From: jorisvandenbossche
But I think you are just trying to do a shift
?
Comment From: ohadle
Thanks! min_periods
seems to clarify what's going on.
shift
looks relevant, perhaps I could use it with a merge
or the like to produce several shifted columns.