Pandas rolling.agg ignores NaNs

from pandas import DataFrame, Timestamp
from numpy import nan

b = DataFrame({u'average_rrc_connected': {Timestamp('2016-11-29 17:30:00', freq='15T'): nan, Timestamp('2016-11-29 16:00:00', freq='15T'): 0.0, Timestamp('2016-11-29 17:15:00', freq='15T'): nan, Timestamp('2016-11-29 17:00:00', freq='15T'): 0.0, Timestamp('2016-11-29 15:30:00', freq='15T'): nan, Timestamp('2016-11-29 15:45:00', freq='15T'): 0.0, Timestamp('2016-11-29 16:15:00', freq='15T'): 0.0, Timestamp('2016-11-29 16:30:00', freq='15T'): 0.0, Timestamp('2016-11-29 16:45:00', freq='15T'): 0.0}})

In [26]: b
Out[26]:
                     average_rrc_connected
2016-11-29 15:30:00                    NaN
2016-11-29 15:45:00                    0.0
2016-11-29 16:00:00                    0.0
2016-11-29 16:15:00                    0.0
2016-11-29 16:30:00                    0.0
2016-11-29 16:45:00                    0.0
2016-11-29 17:00:00                    0.0
2016-11-29 17:15:00                    NaN
2016-11-29 17:30:00                    NaN

rb = b.rolling(window=2)

Show the rolling window:

def show_x(x):
    print x
    return x[1]

In [28]: rs_b = rb.agg(show_x)
[ 0.  0.]
[ 0.  0.]
[ 0.  0.]
[ 0.  0.]
[ 0.  0.]

In [29]: rs_b
Out[29]:
                     average_rrc_connected
2016-11-29 15:30:00                    NaN
2016-11-29 15:45:00                    NaN
2016-11-29 16:00:00                    0.0
2016-11-29 16:15:00                    0.0
2016-11-29 16:30:00                    0.0
2016-11-29 16:45:00                    0.0
2016-11-29 17:00:00                    0.0
2016-11-29 17:15:00                    NaN
2016-11-29 17:30:00                    NaN

Problem description

In this example, I expect the 15:45 slot to show 0.0 instead of NaN. This is to show windows that include NaNs are automatically aggregated as NaN, even if the aggregation function does something different.

What I'm actually trying to achieve is a kind of "rolling reshape", where "historical" data from the beginning of the window (including NaNs) is placed in columns alongside each other. Perhaps there's a better way to achieve this:

def get_index(i):
    f = lambda x: x[i]
    f.__name__ = str(i)
    return f

rs_b2 = rb.agg({col: {i - 1: get_index(i) for i in range(2)} for col in b.columns})

In [37]: rs_b2
Out[37]:
                    average_rrc_connected
                                        0   -1
2016-11-29 15:30:00                   NaN  NaN
2016-11-29 15:45:00                   NaN  NaN
2016-11-29 16:00:00                   0.0  0.0
2016-11-29 16:15:00                   0.0  0.0
2016-11-29 16:30:00                   0.0  0.0
2016-11-29 16:45:00                   0.0  0.0
2016-11-29 17:00:00                   0.0  0.0
2016-11-29 17:15:00                   NaN  NaN
2016-11-29 17:30:00                   NaN  NaN

Where the expected output would be for the 15:45 slot to have -1: NaN, 0: 0.0.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 16.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 28.6.1 Cython: 0.25.2 numpy: 1.12.0 scipy: 0.18.1 statsmodels: 0.8.0 xarray: 0.9.1 IPython: 5.2.2 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.3.0 numexpr: 2.6.2 matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.5.3 html5lib: 0.999 httplib2: None apiclient: None sqlalchemy: 1.1.5 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.9.5 boto: 2.45.0 pandas_datareader: None

Comment From: jorisvandenbossche

@ohadle This is not because the function that is applied on a window that includes a NaN returns a NaN per se, but because we say that the full window should contain values before we apply the function on that window. However, this is the default behaviour, can easily be changed with the min_periods keyword:

In [23]: b.rolling(window=2).agg(show_x)
[ 0.  0.]
[ 0.  0.]
[ 0.  0.]
[ 0.  0.]
[ 0.  0.]
Out[23]: 
                     average_rrc_connected
2016-11-29 15:30:00                    NaN
2016-11-29 15:45:00                    NaN
2016-11-29 16:00:00                    0.0
2016-11-29 16:15:00                    0.0
2016-11-29 16:30:00                    0.0
2016-11-29 16:45:00                    0.0
2016-11-29 17:00:00                    0.0
2016-11-29 17:15:00                    NaN
2016-11-29 17:30:00                    NaN

In [24]: b.rolling(window=2, min_periods=1).agg(show_x)
[ nan   0.]
[ 0.  0.]
[ 0.  0.]
[ 0.  0.]
[ 0.  0.]
[ 0.  0.]
[  0.  nan]
Out[24]: 
                     average_rrc_connected
2016-11-29 15:30:00                    NaN
2016-11-29 15:45:00                    0.0
2016-11-29 16:00:00                    0.0
2016-11-29 16:15:00                    0.0
2016-11-29 16:30:00                    0.0
2016-11-29 16:45:00                    0.0
2016-11-29 17:00:00                    0.0
2016-11-29 17:15:00                    NaN
2016-11-29 17:30:00                    NaN

Comment From: jorisvandenbossche

But I think you are just trying to do a shift ?

Comment From: ohadle

Thanks! min_periods seems to clarify what's going on. shift looks relevant, perhaps I could use it with a merge or the like to produce several shifted columns.

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`