IMO somewhat unexpected behavior depending on the arguments provided to shift after a groupby:
In [26]: ser = pd.Series(range(3), index=pd.date_range('1/1/18', periods=3))
In [28]: ser.groupby([1,1,1]).shift()
Out[28]:
2018-01-01 NaN
2018-01-02 0.0
2018-01-03 1.0
Freq: D, dtype: float64
# No-op with freq keyword
In [34]: ser.groupby([1,1,1]).shift(0, freq='D')
Out[34]:
2018-01-01 0
2018-01-02 1
2018-01-03 2
Freq: D, Name: 1, dtype: int64
# Suddenly inserted the grouper into the first level of a MI?
In [30]: ser.groupby([1,1,1]).shift(freq='D')
Out[30]:
1 2018-01-02 0
2018-01-03 1
2018-01-04 2
Name: 1, dtype: int64
This is arguably in contrast to #22053 which is asking for the behavior of the last item consistently, but I think that is the one that is actually incorrect out of all examples and causes misalignment with the index of the original caller.
cc @SimonAlecks who originally found this in #21235
Expected Output
In [30]: ser.groupby([1,1,1]).shift(freq='D')
Out[34]:
2018-01-01 NaN
2018-01-02 0.0
2018-01-03 1.0
Freq: D, dtype: float64
Output of pd.show_versions()
Comment From: simonariddell
So @WillAyd here is what I can tell so far,
https://github.com/pandas-dev/pandas/blob/d43ac979fdf3038c713de93c7b27b4140be6c9c5/pandas/core/groupby/groupby.py#L705-L707 In this method mutated returns as True in the case where the index is shifted, which then passes into line 712 as not_indexed_same. In this case keys is the same value for both the case where it has expected and unexpected benefits ``` python: keys
Int64Index([1], dtype='int64')
One question is, is this expected behavior here?
This then passes here:
https://github.com/pandas-dev/pandas/blob/d43ac979fdf3038c713de93c7b27b4140be6c9c5/pandas/core/groupby/groupby.py#L886-L888
If this block is hit, it's good.
However, the problem we see is here:
https://github.com/pandas-dev/pandas/blob/d43ac979fdf3038c713de93c7b27b4140be6c9c5/pandas/core/groupby/groupby.py#L906-L918
If we debug into it we can see what is happening
``` python
group_keys
> Int64Index([1], dtype='int64')
So this seems to be working as expected, it's just being given group_keys that don't really make sense.
In fact, we can generalize this problem a little by noticing it's 'unrelated' to the shift function so to speak. ``` python: gb = ser.groupby([1, 1, 1])
gb.shift(freq='D')
def _mutate(x): return x.reindex(x.index[0:2]) gb.apply(_mutate)
1 2018-01-01 0 2018-01-02 1 Name: 1, dtype: int64 ``` Essentially any groupby operation that mutates the dataframe induces this strange index behavior.
However, what is interesting is that in all cases the group_keys are wrong (I'm assuming Int64Index([1], dtype='int64') is wrong). However, it's only the cases where the data is also mutated that it hits the conditional logic on groupby.py 906 (as included above), where this wrong key gives us unexpected output.
This originates from the call in the apply method in ops.py, specifically here: https://github.com/pandas-dev/pandas/blob/d43ac979fdf3038c713de93c7b27b4140be6c9c5/pandas/core/groupby/ops.py#L152-L154
Where self.levels[0] returns Int64Index([1], dtype='int64'.
Anyway, I probably need to read more unit tests to understand the expected behavior and when and why having conditional logic for mutated indices is necessary, and when and why the group_keys works as intended. I'll continue looking into this and add updates.
https://github.com/pandas-dev/pandas/blob/d43ac979fdf3038c713de93c7b27b4140be6c9c5/pandas/core/groupby/ops.py#L164-L167
Comment From: jbrockmendel
@WillAyd on main i'm seeing a result that looks reasonable but also doesn't match your Expected. can you take a look
There's also a test_pct_change with an xfail that points back to this issue that looks like the current (raising) behavior is correct. any idea whats going on there?