Pandas pd.core.groupby.groupby.DataFrameGroupBy.nth yielding false values with Interval MultiIndex Interval

Hi,

using a solution for my problem posted on SO, I stumbled upon this bug. Thanks @jorisvandenbossche for looking into that matter and confirming this to be a bug.

np.random.seed(123)
df = pd.DataFrame({'temp_a': np.random.rand(50) * 50 + 20,
                   'temp_b': np.random.rand(50) * 30 + 40,
                   'power_deg': np.random.rand(50),
                   'eta': 1 - np.random.rand(50) / 5},
                  index=pd.date_range(start='20181201', freq='T', periods=50))

df_grpd = df.groupby(
    [pd.cut(df.temp_a, np.arange(0, 100, 5)),  # categorical for temp_a
     pd.cut(df.temp_b, np.arange(0, 100, 5)),   # categorical for temp_b
     pd.cut(df.power_deg, np.arange(0, 1, 1 / 20))  # categorical for power_deg
    ])

df_grpd.nth(1).T
# Out:
temp_a       (65, 70]
temp_b       (50, 55]
power_deg (0.8, 0.85]
temp_a      36.147946
temp_b      57.832956
power_deg    0.983631
eta          0.974274

df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T

# Out:
temp_a       (65, 70]            
temp_b       (50, 55]            
power_deg (0.8, 0.85] (0.8, 0.85]
temp_a      69.038210   46.591379
temp_b      52.510666   57.173709
power_deg    0.846506    0.988345
eta          0.867026    0.885187

Problem description

When using groupby with df.cut it seems that taking the n-th row of a group with df_grpd.nth(n) with n > 1 results in values which lie out of the group boundaries, as shown with df_grpd.nth(1).T in my code example. Furthermore sometimes there are multiple rows per group for n=0, as shown with df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T, also with values outside of the interval.

Expected Output

I guess the expected output is quite clear in this case... For df_grpd.nth(1).T:

temp_a       (65, 70]
temp_b       (50, 55]
power_deg (0.8, 0.85]
temp_a      66.147946  # some values within the interval...
temp_b      53.832956  # some values within the interval...
power_deg    0.833631  # some values within the interval...
eta          0.974274

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None pandas: 0.23.4 pytest: 4.0.1 pip: 18.1 setuptools: 40.6.2 Cython: 0.29 numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.2.0 sphinx: 1.8.2 patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.1 openpyxl: 2.5.11 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.1.2 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.14 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Is there currently any workaround for this problem? Can I somehow access the group values in any other way? Thanks in advance!

Comment From: WillAyd

Can you create a minimal sample to recreate the issue? This is a rather large one so harder to reason about than it needs to be

Comment From: JoElfner

Yep, reduced it a bit:

np.random.seed(123)

df = pd.DataFrame({'temp_a': np.random.rand(15) * 50 + 20,
                   'eta': 1 - np.random.rand(15) / 5},
                  index=pd.date_range(start='20181201', freq='T', periods=15))

df_grpd = df.groupby(
    [pd.cut(df.temp_a, np.arange(20, 70 + 1, 25)),  # categorical for temp_a
    ])

df_grpd.nth(0)
df_grpd.nth(1)
df_grpd.nth(2)  # And so on...

Comment From: jorisvandenbossche

@scootty1 your last example code produces a lot of NaNs in the intervals, though

Comment From: jorisvandenbossche

Aha, and that actually also seems to be the problem. So based on that, creating a smaller example:

In [147]: idx = pd.MultiIndex([pd.CategoricalIndex([pd.Interval(0, 1), pd.Interval(1, 2)]), pd.CategoricalIndex([pd.Interval(0, 10), pd.Interval(10, 20)])], [[0, 0, 0, 1, 1], [0, 1, 1, 0, -1]])                   

In [148]: df = pd.DataFrame({'col': range(len(idx))}, index=idx)                                                                                                                                                    

In [149]: df                                                                                                                                                                                                        
Out[149]: 
                 col
(0, 1] (0, 10]     0
       (10, 20]    1
       (10, 20]    2
(1, 2] (0, 10]     3
       NaN         4

In [150]: df.groupby(level=[0, 1]).nth(0)                                                                                                                                                                           
Out[150]: 
                 col
(0, 1] (0, 10]     0
       (10, 20]    1
(1, 2] (0, 10]     3
       (0, 10]     4

So the result is kind of correct, apart from the fact that the NaN in the index gets filled with the previous category.

I have a vague recollection we had an issue about this before.

Comment From: JoElfner

Interestingly it is not showing the NaNs in my case... But I extended the bins to not produce any NaNs.

Comment From: jorisvandenbossche

But I extended the bins to not produce any NaNs.

But do you still see the bug now? (I don't with your updated example from https://github.com/pandas-dev/pandas/issues/24205#issuecomment-445870771)

Comment From: JoElfner

Nope, seems to be gone. Didn't check that thoroughly enough. But still strange that I don't see any NaNs, even when I reproduce joris' example.

Comment From: jorisvandenbossche

But still strange that I don't see any NaNs, even when I reproduce joris' example.

Can you print the full code and output of what you did?

Comment From: JoElfner

No, unluckily not. After a restart of the console my output is the same as in your example. Should have restarted right away... But before the restart I had just copied your example to the IPython Console.

Comment From: mroeschke

The example in https://github.com/pandas-dev/pandas/issues/24205#issuecomment-445883162 looks fixed. Could use a test

In [6]: In [150]: df.groupby(level=[0, 1]).nth(0)
   ...: 
<ipython-input-6-22766b7e6d27>:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(level=[0, 1]).nth(0)
Out[6]: 
                 col
(0, 1] (0, 10]     0
       (10, 20]    1
(1, 2] (0, 10]     3

Comment From: ltartaro

take

Pandas pd.core.groupby.groupby.DataFrameGroupBy.nth yielding false values with Interval MultiIndex Interval

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`