Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
from datetime import date
df = pd.DataFrame({'A': [i for i in range(10)],
'date': [date(1970,1,i) for _ in range(2) for i in range(1,6)],
'id': [i for i in range(2) for _ in range(5)],
'exp_date': [date(1970,1,3) for _ in range(2) for _ in range(1,6)]})
df = df.set_index(['date', 'exp_date']).sort_index()
for window in df.groupby('id').A.rolling(2):
print(window.loc[pd.IndexSlice[:, window.index[-1][0]:]].sum(min_count=2))
print(df.groupby('id').A.rolling(2).apply(lambda x: x.loc[pd.IndexSlice[:, x.index[-1][0]:]].sum(min_count=2)))
Issue Description
I have a dataframe with a multiindex on date and an expiration date as provided in the minimal code example.
import pandas as pd
from datetime import date
df = pd.DataFrame({'A': [i for i in range(10)],
'date': [date(1970,1,i) for _ in range(2) for i in range(1,6)],
'id': [i for i in range(2) for _ in range(5)],
'exp_date': [date(1970,1,3) for _ in range(2) for _ in range(1,6)]})
df = df.set_index(['date', 'exp_date']).sort_index()
for window in df.groupby('id').A.rolling(2):
print(window.loc[pd.IndexSlice[:, window.index[-1][0]:]].sum(min_count=2))
print(df.groupby('id').A.rolling(2).apply(lambda x: x.loc[pd.IndexSlice[:, x.index[-1][0]:]].sum(min_count=2)))
I want to roll sum of a column - here A - grouped by an ID such we only use values where expiration date is higher than or equal to actual date.
When I loop over the window I get fine results compared to my expected results:
nan
1
3
nan
nan
nan
11
13
nan
nan
But when I use the apply method, I get an unexpected result:
id date exp_date
0 1970-01-01 1970-01-03 NaN
1970-01-02 1970-01-03 1.0
1970-01-03 1970-01-03 3.0
1970-01-04 1970-01-03 5.0
1970-01-05 1970-01-03 7.0
1 1970-01-01 1970-01-03 NaN
1970-01-02 1970-01-03 NaN
1970-01-03 1970-01-03 NaN
1970-01-04 1970-01-03 NaN
1970-01-05 1970-01-03 NaN
Name: A, dtype: float64
where I would expect the similar results as my loop:
id date exp_date
0 1970-01-01 1970-01-03 NaN
1970-01-02 1970-01-03 1.0
1970-01-03 1970-01-03 3.0
1970-01-04 1970-01-03 NaN
1970-01-05 1970-01-03 NaN
1 1970-01-01 1970-01-03 NaN
1970-01-02 1970-01-03 11.0
1970-01-03 1970-01-03 13.0
1970-01-04 1970-01-03 NaN
1970-01-05 1970-01-03 NaN
Name: A, dtype: float64
Expected Behavior
first print:
nan
1
3
nan
nan
nan
11
13
nan
nan
second print:
id date exp_date
0 1970-01-01 1970-01-03 NaN
1970-01-02 1970-01-03 1.0
1970-01-03 1970-01-03 3.0
1970-01-04 1970-01-03 NaN
1970-01-05 1970-01-03 NaN
1 1970-01-01 1970-01-03 NaN
1970-01-02 1970-01-03 11.0
1970-01-03 1970-01-03 13.0
1970-01-04 1970-01-03 NaN
1970-01-05 1970-01-03 NaN
Name: A, dtype: float64
Installed Versions
Comment From: phofl
Hi, thanks for your report. Can you trim this down a bit? This is really complicated.
Comment From: Udneowi
@phofl Having had similar issues I believe this has something to do with the indexing of the windows passed. Sorting the data frame beforehand by id and then time fixes the issue for me.
print(df.sort_values(['id', 'date']).groupby('id').A.rolling(2).apply(lambda x: x.loc[pd.IndexSlice[:, x.index[-1][0]:]].sum(min_count=2)))
I was personally using a temporal rolling window where for each user each day was unique. I wanted to weigh days further away lower but ran into trouble running a custom function in apply.
df.groupby('user').rolling('7D', on = 'datetime', min_periods = 5).apply(custom_func)
Basically the data passed to the custom function has duplicated time indexes, eg:
datetime
2016-10-17 0.0
2016-10-17 0.0
2016-10-17 0.0
2016-10-18 0.0
2016-10-20 0.0
dtype: float64
Even though that should not be possible (as they are unique for each group/user). Without knowing too much of the pandas structure it might have something to do with how the indexes are initialized. Calling RollingGroupby._get_window_indexer().get_window_bounds() gives:
(array([ 0, 0, 0, ..., 50455, 50455, 50456]),
array([ 1, 2, 3, ..., 50458, 50459, 50460]))
which are structured according to all of the data, and not the individual groups. I might be totally off though. Similar to OPs problem I fixed mine by sorting the data first.
df.sort_values(['user', 'datetime']).groupby('user').rolling('7D', on = 'datetime', min_periods = 5).apply(custom_func)
Comment From: phofl
Closing for now. We need a simple reproducer so that we can see that this is a bug inside of pandas and not something in your code