I found another oddity while digging through #13966.
Begin with the initial DataFrame in that issue:
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.arange(40)})
Save the grouping:
In [215]: g = df.groupby('A')
Compute the rolling sum:
In [216]: r = g.rolling(4)
In [217]: r.sum()
Out[217]:
A B
A
1 0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 6.0
4 4.0 10.0
5 4.0 14.0
6 4.0 18.0
7 4.0 22.0
8 4.0 26.0
9 4.0 30.0
... ... ...
2 30 8.0 114.0
31 8.0 118.0
3 32 NaN NaN
33 NaN NaN
34 NaN NaN
35 12.0 134.0
36 12.0 138.0
37 12.0 142.0
38 12.0 146.0
39 12.0 150.0
[40 rows x 2 columns]
It maintains the by
column (A
)! That column should not be in the resulting DataFrame.
It gets weirder if I compute the sum over the entire grouping and then re-do the rolling calculation. Now by
column is gone as expected:
In [218]: g.sum()
Out[218]:
B
A
1 190
2 306
3 284
In [219]: r.sum()
Out[219]:
B
A
1 0 NaN
1 NaN
2 NaN
3 6.0
4 10.0
5 14.0
6 18.0
7 22.0
8 26.0
9 30.0
... ...
2 30 114.0
31 118.0
3 32 NaN
33 NaN
34 NaN
35 134.0
36 138.0
37 142.0
38 146.0
39 150.0
[40 rows x 1 columns]
So the grouping summation has some sort of side effect.
Comment From: chrisaycock
A little note while digging through more code: _convert_grouper
in groupby.py
has:
if isinstance(grouper, dict):
...
elif isinstance(grouper, Series):
...
elif isinstance(grouper, (list, Series, Index, np.ndarray)):
...
else:
...
The grouper
is compared twice to Series
. I will fix this when I clean-up everything.
Comment From: chrisaycock
I can fix the issue if I set the group selection:
g._set_group_selection()
I think we need this function at the start of .rolling()
.
Seems similar to #12839
Comment From: jreback
This is defined behavior; in, that it is identical to .apply
on the groupby.
In [10]: df.groupby('A').rolling(4).sum()
Out[10]:
A B
A
1 0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 6.0
4 4.0 10.0
... ... ...
3 35 12.0 134.0
36 12.0 138.0
37 12.0 142.0
38 12.0 146.0
39 12.0 150.0
[40 rows x 2 columns]
In [11]: df.groupby('A').rolling(4).apply(lambda x: x.sum())
Out[11]:
A B
A
1 0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 6.0
4 4.0 10.0
... ... ...
3 35 12.0 134.0
36 12.0 138.0
37 12.0 142.0
38 12.0 146.0
39 12.0 150.0
[40 rows x 2 columns]
you can look back at the issues, IIRC @jorisvandenbossche and I had a long conversation about this.
Comment From: chrisaycock
Hmm:
In [617]: df.groupby('A').sum()
Out[617]:
B
A
1 190
2 306
3 284
In [618]: df.groupby('A').apply(lambda x: x.sum())
Out[618]:
A B
A
1 20 190
2 24 306
3 24 284
In addition to .rolling()
and .apply()
, .ohlc()
and .expanding()
keep the by
column following a .groupby()
.
Comment From: jreback
on reread this should be consistent - so marking as a bug prob should not include the grouping column/level even though apply does
Comment From: ohadle
A similar thing happens with index columns.
from pandas import DataFrame, Timestamp
c = pandas.DataFrame({u'ul_payload': {('a', Timestamp('2016-11-01 06:15:00')): 5, ('a', Timestamp('2016-11-01 07:45:00')): 8, ('a', Timestamp('2016-11-01 09:00:00')): 9, ('a', Timestamp('2016-11-01 07:15:00')): 6, ('a', Timestamp('2016-11-01 07:30:00')): 7, ('a', Timestamp('2016-11-01 06:00:00')): 4}, u'dl_payload': {('a', Timestamp('2016-11-01 06:15:00')): 15, ('a', Timestamp('2016-11-01 07:45:00')): 18, ('a', Timestamp('2016-11-01 09:00:00')): 19, ('a', Timestamp('2016-11-01 07:15:00')): 16, ('a', Timestamp('2016-11-01 07:30:00')): 17, ('a', Timestamp('2016-11-01 06:00:00')): 14}})
In [27]: c
Out[27]:
dl_payload ul_payload
a 2016-11-01 06:00:00 14 4
2016-11-01 06:15:00 15 5
2016-11-01 07:15:00 16 6
2016-11-01 07:30:00 17 7
2016-11-01 07:45:00 18 8
2016-11-01 09:00:00 19 9
In [29]: c.groupby(level=0).rolling(window=3).agg(np.sum)
Out[29]:
dl_payload ul_payload
a a 2016-11-01 06:00:00 NaN NaN
2016-11-01 06:15:00 NaN NaN
2016-11-01 07:15:00 45.0 15.0
2016-11-01 07:30:00 48.0 18.0
2016-11-01 07:45:00 51.0 21.0
2016-11-01 09:00:00 54.0 24.0
But not with group_keys=False
:
In [48]: c.groupby(level=0, group_keys=False).rolling(window=3).agg(np.sum)
Out[48]:
dl_payload ul_payload
a 2016-11-01 06:00:00 NaN NaN
2016-11-01 06:15:00 NaN NaN
2016-11-01 07:15:00 45.0 15.0
2016-11-01 07:30:00 48.0 18.0
2016-11-01 07:45:00 51.0 21.0
2016-11-01 09:00:00 54.0 24.0
Comment From: vss888
Why is the issue closed? The problem is still there (pandas 0.24.2).
Comment From: jreback
this is closed in 0.25 coming soon
Comment From: chiachong
Still the same problem in 0.25.
Workaround: df.groupby('A').rolling(4).sum().reset_index(level=0, drop=True)
Comment From: andreas-vester
The problem still exists in v1.0.1
Comment From: davidgilbertson
Still an issue in v2. So for future travellers...
.reset_index()
is a good workaround as mentioned above. You can also use .tolist()
if you're just assigning back to a new column of the original data frame, and the sort order isn't being changed by gruopby
.
df["RollingMean"] = df.groupby(["A", "B"]).rolling(2).Value.mean().reset_index(level=[0, 1], drop=True)
# OR, if your data is already sorted by A, B
df["RollingMean"] = df.groupby(["A", "B"]).rolling(2).Value.mean().tolist()
Comment From: asaph-marvin
Still an issue in v2.2.1:
df = pd.DataFrame(data=range(12), index=pd.MultiIndex.from_product(
[['one', 'two', 'three'], ['a', 'b', 'c', 'd']]), columns=['vals'])
df
Out[108]:
vals
one a 0
b 1
c 2
d 3
two a 4
b 5
c 6
d 7
three a 8
b 9
c 10
d 11
regular .sum()
output looks fine:
df.groupby(level=0).sum()
Out[103]:
vals
one 6
three 38
two 22
with .rolling()
the groupby index level is duplicated:
df.groupby(level=0).rolling(3).sum()
Out[104]:
vals
one one a NaN
b NaN
c 3.0
d 6.0
three three a NaN
b NaN
c 27.0
d 30.0
two two a NaN
b NaN
c 15.0
d 18.0
Is there a fix planned for this issue?
From the discussion above it's not clear if this is intended behavior or a bug? If it's intended, can someone explain why? As a user, my expectation is to get something like this:
df.groupby(level=0).rolling(3).sum().droplevel(0)
Out[110]:
vals
one a NaN
b NaN
c 3.0
d 6.0
three a NaN
b NaN
c 27.0
d 30.0
two a NaN
b NaN
c 15.0
d 18.0
but without having to do .droplevel()
Comment From: mayurankv
Also running into this