In pandas 0.17.1, grouping a DatetimeIndex with monthly frequency produced labels keyed at the start of the month, and nth()
produces a label keyed by the selected row. In 0.18, we unconditionally return groups labelled by the end of the month.
Old Pandas
In [25]: pd.__version__
Out[25]: u'0.17.1'
In [26]: dates = pd.date_range('2014', periods=90)
In [27]: pd.Series(index=dates).groupby(pd.TimeGrouper(freq='M')).nth(5)
Out[27]:
2014-01-06 NaN
2014-02-06 NaN
2014-03-06 NaN
dtype: float64
Latest Pandas
In [71]: pd.__version__
Out[71]: u'0.18.1'
In [72]: dates = pd.date_range('2014', periods=90)
In [73]: pd.Series(index=dates).groupby(pd.TimeGrouper(freq='M')).nth(5)
Out[73]:
2014-01-31 NaN
2014-02-28 NaN
2014-03-31 NaN
dtype: float64
Expected Output
I expect the behavior from 0.17.1.
Comment From: ssanderson
It looks like there are similar issues for other resampling frequencies:
Weekly resample in 0.17:
In [28]: pd.Series(index=dates).groupby(pd.TimeGrouper(freq='W')).nth(5)
Out[28]:
2014-01-11 NaN
2014-01-18 NaN
2014-01-25 NaN
2014-02-01 NaN
2014-02-08 NaN
2014-02-15 NaN
2014-02-22 NaN
2014-03-01 NaN
2014-03-08 NaN
2014-03-15 NaN
2014-03-22 NaN
2014-03-29 NaN
dtype: float64
Weekly resample in 0.18:
In [4]: pd.Series(index=dates).groupby(pd.TimeGrouper(freq='W')).nth(5)
Out[4]:
2014-01-12 NaN
2014-01-19 NaN
2014-01-26 NaN
2014-02-02 NaN
2014-02-09 NaN
2014-02-16 NaN
2014-02-23 NaN
2014-03-02 NaN
2014-03-09 NaN
2014-03-16 NaN
2014-03-23 NaN
2014-03-30 NaN
dtype: float64
Comment From: jreback
M is the offset for month end so I think 0.18.1 is correct here
Comment From: jorisvandenbossche
I don't think this has to do with the resampling frequency, rather with a change in nth
.
nth
has been changed to a 'reducer', with the consequence it takes the grouper as the index by default (instead of a 'filter' which keeps the original index). For example, for mean
, which is also a reducer, nothing has changed (so also in 0.17, the end of the month is used):
In [1]: dates = pd.date_range('2014', periods=90)
In [2]: pd.Series(index=dates).groupby(pd.TimeGrouper(freq='M')).nth(5)
Out[2]:
2014-01-06 NaN
2014-02-06 NaN
2014-03-06 NaN
dtype: float64
In [3]: pd.Series(index=dates).groupby(pd.TimeGrouper(freq='M')).mean()
Out[3]:
2014-01-31 NaN
2014-02-28 NaN
2014-03-31 NaN
Freq: M, dtype: float64
In [4]: pd.__version__
Out[4]: u'0.17.1'
In [1]: dates = pd.date_range('2014', periods=90)
In [2]: pd.Series(index=dates).groupby(pd.TimeGrouper(freq='M')).nth(5)
Out[2]:
2014-01-31 NaN
2014-02-28 NaN
2014-03-31 NaN
dtype: float64
In [3]: pd.Series(index=dates).groupby(pd.TimeGrouper(freq='M')).mean()
Out[3]:
2014-01-31 NaN
2014-02-28 NaN
2014-03-31 NaN
Freq: M, dtype: float64
In [4]: pd.__version__
Out[4]: '0.18.1+199.gc9a27ed'
In 0.17, the index you see in the nth
case is the original index from the series, because nth
acted like a 'filter' (keeping the original index for the filtered rows).
To get the original behaviour of nth
, you can get this with as_index=False
:
In [9]: pd.__version__
Out[9]: '0.18.1+199.gc9a27ed'
In [10]: pd.DataFrame(index=dates, columns=['a']).groupby(pd.TimeGrouper(freq='
...: M'), as_index=False).nth(5)
Out[10]:
a
2014-01-06 NaN
2014-02-06 NaN
2014-03-06 NaN
The problem with this is that this only seems to work with Series (if I use the original Series example, you get an error). There was recently also somebody who reported this on the mailing list (https://groups.google.com/forum/#!topic/pydata/4ZjOP0Lfjdc), and I am actually wondering if we should change the behaviour back to the 0.17 one.
Comment From: ssanderson
I could see use-cases for getting the group label (the current behavior) or for getting the label of the selected row (the old behavior), but I would expect that to be an argument to nth()
rather than to TimeGrouper
. I don't know a ton about the internals of groupby()
, but is it possible that the choice of filter vs. reducer could be made dynamically based on arguments to nth()
? I'm imagining an API like:
pd.DataFrame(...).groupby(...).nth(5, label='group') # Current behavior
pd.DataFrame(...).groupby(...).nth(5, label='selection') # pandas 0.17 behavior
Comment From: jorisvandenbossche
@ssanderson Note that as_index
is an argument to groupby
, not to TimeGrouper
Comment From: rhshadrach
The current behavior appears correct. I think the only thing left in this issue is @jorisvandenbossche's comment that as_index=False
does not work on a Series. I think this issue would be resolved if as_index=False
worked on a SeriesGroupBy, see #36507. Currently, one could do:
s.to_frame(s.name).groupby(pd.TimeGrouper(freq='M'), as_index=False).nth(5)
Comment From: rhshadrach
Looking at this again, with the exception of .nth(0)
, .nth(-1)
and the list/slice equivalents (e.g. .nth([0])
), nth always acts like a filter. E.g.
df = pd.DataFrame(
{'a': [1, 1, 1, 2, 2, 3], 'b': range(6)}
)
gb = df.groupby('a')
result = gb.nth(1)
print(result)
# b
# a
# 1 1
# 2 4
This is not a reducer as group a=3
is dropped, but the output is in the shape of a reducer instead of a filter. In particular, there should be no different between groupby(...).head(5)
and groupby(...).nth([0, 1, 2, 3, 4])
.
I think nth should have the result of a filter here, namely
a b
0 1 1
4 2 4
Comment From: dalejung
Does it make sense for nth
to be a filter while last
, first
are reducers? Relying on the index of aggregated output being the group keys is common behavior. This would break existing code so maybe deprecate first?
Comment From: rhshadrach
Relying on the index of aggregated output being the group keys is common behavior. This would break existing code so maybe deprecate first?
But nth does not aggregate. Only nth(0)
and nth(-1)
will always aggregate; all other arguments to nth act as filters, which may happen to also aggregate depending on the values of the input. For example,
pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).groupby('a').nth(1)
is an empty DataFrame. This is not an aggregation. pandas has (mostly :laughing:) well-defined behavior on how aggregations vs filters vs transformations should behave and nth
was misclassified. In my mind, this makes it a bug.
Comment From: cantudiaz
Hi, the change to the nth()
method's behavior kinda ruined a private library my team has been using and developing for the last 3 years. Aggregations are a very important part of our pipelines, and nth()
was one of them. What options do we have to substitute this behavior (if first
is not good enough)?
Comment From: rhshadrach
@cantudiaz - can you give an example you are trying to replace? If you are using in-axis groupings, you can call set_index
after a call to nth. E.g.
df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
gb = df.groupby('a')
print(gb.nth(1).set_index('a'))
# b
# a
# 1 4
Comment From: cantudiaz
Thanks, @rhshadrach. I think your suggestion works.