Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# read example dataset
df = df.read_csv("example.csv")

df.groupby('key').nth([0,1,2])

Issue Description

Dataset: example.csv

df.groupby('key').nth([0,1,2]) returns something like:

           val
key       
0    60018
0    60022
0    60013
1    60045
1    60019
...

The output seems to be kind of arbitrary, so I'm not sure how the values are getting sorted here. I would expect that relative positions should be matched at the very least. I tried fiddling around with more things and noticed another discrepancy as well.

In [20]: df.set_index('key').sort_index()
Out[20]: 
       val
key       
0    60080
0    60022
0    60018
0    60053
0    60013
..     ...
23   60068
24   60079
24   60033
24   60027
24   60059

[100 rows x 1 columns]

If we try sorting with the stable sort, we get a result that maintains positional order.

In [21]: df.set_index('key').sort_index(kind='stable')
Out[21]: 
       val
key       
0    60013
0    60018
0    60022
0    60043
0    60053
..     ...
23   60068
24   60027
24   60033
24   60059
24   60079

[100 rows x 1 columns]

My feeling is that there might be a bug with quicksort or however things are being sorted with nth?

Expected Behavior

We should be returning something like this:

       val
key       
0    60013
0    60018
0    60022
1    60019
1    60045
...

I would expect that val should be in order based on the original position of the dataframe at the very least.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.9.12.final.0 python-bits : 64 OS : Darwin OS-release : 21.3.0 Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 61.2.0 pip : 22.1.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.4.0 pandas_datareader: None bs4 : None bottleneck : 1.3.5 brotli : None fastparquet : 0.8.1 fsspec : 2022.7.1 gcsfs : None matplotlib : None numba : None numexpr : 2.8.3 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 9.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: rhshadrach

Thanks for the report - can you use a smaller, explicitly constructed (rather than a csv file), dataframe that reproduces the issue.

Comment From: pyrito

@rhshadrach I wasn't able to reproduce this with a smaller example. FWIW, the CSV file is only a 100 lines. I don't think that's too big.

Comment From: phofl

Agreed with @rhshadrach, this seems like it shouldn't have more than a couple of lines, e.g. 5 or something similar

Comment From: pyrito

@phofl I suspect that if this is a sorting issue then a small example may not show up in 5 lines. I've also been trying to create this with a smaller example and I haven't been able to do so. I've been trying a couple of different variations now. E.g.

In [195]: keys = [1,2,1,2,1]

In [196]: vals = [2,4,6,0,1]

In [197]: df = pd.DataFrame({"keys":keys, "vals":vals})

In [198]: df
Out[198]: 
   keys  vals
0     1     2
1     2     4
2     1     6
3     2     0
4     1     1

In [199]: df.groupby('keys').nth(n=[0,1,2])
Out[199]: 
      vals
keys      
1        2
1        6
1        1
2        4
2        0

Comment From: phofl

Try removing lines from your csv file till the bug disappears

Comment From: rhshadrach

df.groupby('key').nth([0,1,2]) returns something like:

val key 0 60018 0 60022 0 60013 1 60045 1 60019 ...

If this is enough to show the issue, then maybe you can reduce the input to just rows where keys are 0 or 1 (or maybe just 0)?

In pandas 2.0, nth has been changed to behave like a filter. The result when the input is filtered to keys being 0 or 1 is now:

      val  key
12  60013    0
16  60017    1
17  60018    0
18  60019    1
21  60022    0
44  60045    1

Does this agree with your expectations?

I tried fiddling around with more things and noticed another discrepancy as well.

If we try sorting with the stable sort, we get a result that maintains positional order.

My feeling is that there might be a bug with quicksort or however things are being sorted with nth?

Are you suggesting quicksort should also preserve order? That is not the case - quicksort is not a stable sorting algorithm.

Comment From: pyrito

@rhshadrach the pandas 2.0 behavior makes more sense to me. That's the behavior that we get when as_index=False. This appears to only happen when as_index=True.

Are you suggesting quicksort should also preserve order? That is not the case - quicksort is not a stable sorting algorithm.

You're right, quicksort doesn't preserve order because of how pivoting is done. That point is valid. I was just pointing out that position should be maintained if that was the intention here (e.g. with a stable sort), but I guess this won't really matter anymore with pandas 2.0.

Comment From: rhshadrach

Thanks @pyrito

@rhshadrach the pandas 2.0 behavior makes more sense to me. That's the behavior that we get when as_index=False. This appears to only happen when as_index=True.

My comment https://github.com/pandas-dev/pandas/issues/51250#issuecomment-1423589439 may have not been clear, the result I posted as the behavior for 2.0 is with either as_index=True or as_index=False. In general, as_index only has an impact on reductions.

I think the issue has been resolved, if you disagree comment here and we can reopen.