Hi,
using a solution for my problem posted on SO, I stumbled upon this bug. Thanks @jorisvandenbossche for looking into that matter and confirming this to be a bug.
np.random.seed(123)
df = pd.DataFrame({'temp_a': np.random.rand(50) * 50 + 20,
'temp_b': np.random.rand(50) * 30 + 40,
'power_deg': np.random.rand(50),
'eta': 1 - np.random.rand(50) / 5},
index=pd.date_range(start='20181201', freq='T', periods=50))
df_grpd = df.groupby(
[pd.cut(df.temp_a, np.arange(0, 100, 5)), # categorical for temp_a
pd.cut(df.temp_b, np.arange(0, 100, 5)), # categorical for temp_b
pd.cut(df.power_deg, np.arange(0, 1, 1 / 20)) # categorical for power_deg
])
df_grpd.nth(1).T
# Out:
temp_a (65, 70]
temp_b (50, 55]
power_deg (0.8, 0.85]
temp_a 36.147946
temp_b 57.832956
power_deg 0.983631
eta 0.974274
df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T
# Out:
temp_a (65, 70]
temp_b (50, 55]
power_deg (0.8, 0.85] (0.8, 0.85]
temp_a 69.038210 46.591379
temp_b 52.510666 57.173709
power_deg 0.846506 0.988345
eta 0.867026 0.885187
Problem description
When using groupby
with df.cut
it seems that taking the n-th row of a group with df_grpd.nth(n)
with n > 1
results in values which lie out of the group boundaries, as shown with df_grpd.nth(1).T
in my code example. Furthermore sometimes there are multiple rows per group for n=0
, as shown with df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T
, also with values outside of the interval.
Expected Output
I guess the expected output is quite clear in this case...
For df_grpd.nth(1).T
:
temp_a (65, 70]
temp_b (50, 55]
power_deg (0.8, 0.85]
temp_a 66.147946 # some values within the interval...
temp_b 53.832956 # some values within the interval...
power_deg 0.833631 # some values within the interval...
eta 0.974274
Output of pd.show_versions()
Is there currently any workaround for this problem? Can I somehow access the group values in any other way? Thanks in advance!
Comment From: WillAyd
Can you create a minimal sample to recreate the issue? This is a rather large one so harder to reason about than it needs to be
Comment From: JoElfner
Yep, reduced it a bit:
np.random.seed(123)
df = pd.DataFrame({'temp_a': np.random.rand(15) * 50 + 20,
'eta': 1 - np.random.rand(15) / 5},
index=pd.date_range(start='20181201', freq='T', periods=15))
df_grpd = df.groupby(
[pd.cut(df.temp_a, np.arange(20, 70 + 1, 25)), # categorical for temp_a
])
df_grpd.nth(0)
df_grpd.nth(1)
df_grpd.nth(2) # And so on...
Comment From: jorisvandenbossche
@scootty1 your last example code produces a lot of NaNs in the intervals, though
Comment From: jorisvandenbossche
Aha, and that actually also seems to be the problem. So based on that, creating a smaller example:
In [147]: idx = pd.MultiIndex([pd.CategoricalIndex([pd.Interval(0, 1), pd.Interval(1, 2)]), pd.CategoricalIndex([pd.Interval(0, 10), pd.Interval(10, 20)])], [[0, 0, 0, 1, 1], [0, 1, 1, 0, -1]])
In [148]: df = pd.DataFrame({'col': range(len(idx))}, index=idx)
In [149]: df
Out[149]:
col
(0, 1] (0, 10] 0
(10, 20] 1
(10, 20] 2
(1, 2] (0, 10] 3
NaN 4
In [150]: df.groupby(level=[0, 1]).nth(0)
Out[150]:
col
(0, 1] (0, 10] 0
(10, 20] 1
(1, 2] (0, 10] 3
(0, 10] 4
So the result is kind of correct, apart from the fact that the NaN in the index gets filled with the previous category.
I have a vague recollection we had an issue about this before.
Comment From: JoElfner
Interestingly it is not showing the NaNs in my case... But I extended the bins to not produce any NaNs.
Comment From: jorisvandenbossche
But I extended the bins to not produce any NaNs.
But do you still see the bug now? (I don't with your updated example from https://github.com/pandas-dev/pandas/issues/24205#issuecomment-445870771)
Comment From: JoElfner
Nope, seems to be gone. Didn't check that thoroughly enough. But still strange that I don't see any NaNs, even when I reproduce joris' example.
Comment From: jorisvandenbossche
But still strange that I don't see any NaNs, even when I reproduce joris' example.
Can you print the full code and output of what you did?
Comment From: JoElfner
No, unluckily not. After a restart of the console my output is the same as in your example. Should have restarted right away... But before the restart I had just copied your example to the IPython Console.
Comment From: mroeschke
The example in https://github.com/pandas-dev/pandas/issues/24205#issuecomment-445883162 looks fixed. Could use a test
In [6]: In [150]: df.groupby(level=[0, 1]).nth(0)
...:
<ipython-input-6-22766b7e6d27>:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
df.groupby(level=[0, 1]).nth(0)
Out[6]:
col
(0, 1] (0, 10] 0
(10, 20] 1
(1, 2] (0, 10] 3
Comment From: ltartaro
take