Pandas Rolling groupby should not maintain the by column in the resulting DataFrame

I found another oddity while digging through #13966.

Begin with the initial DataFrame in that issue:

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

Save the grouping:

In [215]: g = df.groupby('A')

Compute the rolling sum:

In [216]: r = g.rolling(4)

In [217]: r.sum()
Out[217]:
         A      B
A
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
  5    4.0   14.0
  6    4.0   18.0
  7    4.0   22.0
  8    4.0   26.0
  9    4.0   30.0
...    ...    ...
2 30   8.0  114.0
  31   8.0  118.0
3 32   NaN    NaN
  33   NaN    NaN
  34   NaN    NaN
  35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

It maintains the by column (A)! That column should not be in the resulting DataFrame.

It gets weirder if I compute the sum over the entire grouping and then re-do the rolling calculation. Now by column is gone as expected:

In [218]: g.sum()
Out[218]:
     B
A
1  190
2  306
3  284

In [219]: r.sum()
Out[219]:
          B
A
1 0     NaN
  1     NaN
  2     NaN
  3     6.0
  4    10.0
  5    14.0
  6    18.0
  7    22.0
  8    26.0
  9    30.0
...     ...
2 30  114.0
  31  118.0
3 32    NaN
  33    NaN
  34    NaN
  35  134.0
  36  138.0
  37  142.0
  38  146.0
  39  150.0

[40 rows x 1 columns]

So the grouping summation has some sort of side effect.

Comment From: chrisaycock

A little note while digging through more code: _convert_grouper in groupby.py has:

    if isinstance(grouper, dict):
        ...
    elif isinstance(grouper, Series):
        ...
    elif isinstance(grouper, (list, Series, Index, np.ndarray)):
        ...
    else:
        ...

The grouper is compared twice to Series. I will fix this when I clean-up everything.

Comment From: chrisaycock

I can fix the issue if I set the group selection:

g._set_group_selection()

I think we need this function at the start of .rolling().

Seems similar to #12839

Comment From: jreback

This is defined behavior; in, that it is identical to .apply on the groupby.

In [10]: df.groupby('A').rolling(4).sum()
Out[10]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

In [11]: df.groupby('A').rolling(4).apply(lambda x: x.sum())
Out[11]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

you can look back at the issues, IIRC @jorisvandenbossche and I had a long conversation about this.

Comment From: chrisaycock

Hmm:

In [617]: df.groupby('A').sum()
Out[617]:
     B
A
1  190
2  306
3  284

In [618]: df.groupby('A').apply(lambda x: x.sum())
Out[618]:
    A    B
A
1  20  190
2  24  306
3  24  284

In addition to .rolling() and .apply(), .ohlc() and .expanding() keep the by column following a .groupby().

Comment From: jreback

on reread this should be consistent - so marking as a bug prob should not include the grouping column/level even though apply does

Comment From: ohadle

A similar thing happens with index columns.

from pandas import DataFrame, Timestamp

c = pandas.DataFrame({u'ul_payload': {('a', Timestamp('2016-11-01 06:15:00')): 5, ('a', Timestamp('2016-11-01 07:45:00')): 8, ('a', Timestamp('2016-11-01 09:00:00')): 9, ('a', Timestamp('2016-11-01 07:15:00')): 6, ('a', Timestamp('2016-11-01 07:30:00')): 7, ('a', Timestamp('2016-11-01 06:00:00')): 4}, u'dl_payload': {('a', Timestamp('2016-11-01 06:15:00')): 15, ('a', Timestamp('2016-11-01 07:45:00')): 18, ('a', Timestamp('2016-11-01 09:00:00')): 19, ('a', Timestamp('2016-11-01 07:15:00')): 16, ('a', Timestamp('2016-11-01 07:30:00')): 17, ('a', Timestamp('2016-11-01 06:00:00')): 14}})

In [27]: c
Out[27]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00          14           4
  2016-11-01 06:15:00          15           5
  2016-11-01 07:15:00          16           6
  2016-11-01 07:30:00          17           7
  2016-11-01 07:45:00          18           8
  2016-11-01 09:00:00          19           9

In [29]: c.groupby(level=0).rolling(window=3).agg(np.sum)
Out[29]:
                         dl_payload  ul_payload
a a 2016-11-01 06:00:00         NaN         NaN
    2016-11-01 06:15:00         NaN         NaN
    2016-11-01 07:15:00        45.0        15.0
    2016-11-01 07:30:00        48.0        18.0
    2016-11-01 07:45:00        51.0        21.0
    2016-11-01 09:00:00        54.0        24.0

But not with group_keys=False:

In [48]: c.groupby(level=0, group_keys=False).rolling(window=3).agg(np.sum)
Out[48]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00         NaN         NaN
  2016-11-01 06:15:00         NaN         NaN
  2016-11-01 07:15:00        45.0        15.0
  2016-11-01 07:30:00        48.0        18.0
  2016-11-01 07:45:00        51.0        21.0
  2016-11-01 09:00:00        54.0        24.0

Comment From: vss888

Why is the issue closed? The problem is still there (pandas 0.24.2).

Comment From: jreback

this is closed in 0.25 coming soon

Comment From: chiachong

Still the same problem in 0.25.

Workaround: df.groupby('A').rolling(4).sum().reset_index(level=0, drop=True)

Comment From: andreas-vester

The problem still exists in v1.0.1

Comment From: davidgilbertson

Still an issue in v2. So for future travellers...

.reset_index() is a good workaround as mentioned above. You can also use .tolist() if you're just assigning back to a new column of the original data frame, and the sort order isn't being changed by gruopby.

df["RollingMean"] = df.groupby(["A", "B"]).rolling(2).Value.mean().reset_index(level=[0, 1], drop=True)

# OR, if your data is already sorted by A, B
df["RollingMean"] = df.groupby(["A", "B"]).rolling(2).Value.mean().tolist()

Comment From: asaph-marvin

Still an issue in v2.2.1:

df = pd.DataFrame(data=range(12), index=pd.MultiIndex.from_product(
    [['one', 'two', 'three'], ['a', 'b', 'c', 'd']]), columns=['vals'])
df
Out[108]: 
         vals
one   a     0
      b     1
      c     2
      d     3
two   a     4
      b     5
      c     6
      d     7
three a     8
      b     9
      c    10
      d    11

regular .sum() output looks fine:

df.groupby(level=0).sum()
Out[103]: 
       vals
one       6
three    38
two      22

with .rolling() the groupby index level is duplicated:

df.groupby(level=0).rolling(3).sum()
Out[104]: 
               vals
one   one   a   NaN
            b   NaN
            c   3.0
            d   6.0
three three a   NaN
            b   NaN
            c  27.0
            d  30.0
two   two   a   NaN
            b   NaN
            c  15.0
            d  18.0

Is there a fix planned for this issue?

From the discussion above it's not clear if this is intended behavior or a bug? If it's intended, can someone explain why? As a user, my expectation is to get something like this:

df.groupby(level=0).rolling(3).sum().droplevel(0)
Out[110]: 
         vals
one   a   NaN
      b   NaN
      c   3.0
      d   6.0
three a   NaN
      b   NaN
      c  27.0
      d  30.0
two   a   NaN
      b   NaN
      c  15.0
      d  18.0

but without having to do .droplevel()

Comment From: mayurankv

Also running into this