Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# read example dataset
df = df.read_csv("example.csv")
df.groupby('key').nth([0,1,2])
Issue Description
Dataset: example.csv
df.groupby('key').nth([0,1,2])
returns something like:
val
key
0 60018
0 60022
0 60013
1 60045
1 60019
...
The output seems to be kind of arbitrary, so I'm not sure how the values are getting sorted here. I would expect that relative positions should be matched at the very least. I tried fiddling around with more things and noticed another discrepancy as well.
In [20]: df.set_index('key').sort_index()
Out[20]:
val
key
0 60080
0 60022
0 60018
0 60053
0 60013
.. ...
23 60068
24 60079
24 60033
24 60027
24 60059
[100 rows x 1 columns]
If we try sorting with the stable sort, we get a result that maintains positional order.
In [21]: df.set_index('key').sort_index(kind='stable')
Out[21]:
val
key
0 60013
0 60018
0 60022
0 60043
0 60053
.. ...
23 60068
24 60027
24 60033
24 60059
24 60079
[100 rows x 1 columns]
My feeling is that there might be a bug with quicksort or however things are being sorted with nth
?
Expected Behavior
We should be returning something like this:
val
key
0 60013
0 60018
0 60022
1 60019
1 60045
...
I would expect that val
should be in order based on the original position of the dataframe at the very least.
Installed Versions
Comment From: rhshadrach
Thanks for the report - can you use a smaller, explicitly constructed (rather than a csv file), dataframe that reproduces the issue.
Comment From: pyrito
@rhshadrach I wasn't able to reproduce this with a smaller example. FWIW, the CSV file is only a 100 lines. I don't think that's too big.
Comment From: phofl
Agreed with @rhshadrach, this seems like it shouldn't have more than a couple of lines, e.g. 5 or something similar
Comment From: pyrito
@phofl I suspect that if this is a sorting issue then a small example may not show up in 5 lines. I've also been trying to create this with a smaller example and I haven't been able to do so. I've been trying a couple of different variations now. E.g.
In [195]: keys = [1,2,1,2,1]
In [196]: vals = [2,4,6,0,1]
In [197]: df = pd.DataFrame({"keys":keys, "vals":vals})
In [198]: df
Out[198]:
keys vals
0 1 2
1 2 4
2 1 6
3 2 0
4 1 1
In [199]: df.groupby('keys').nth(n=[0,1,2])
Out[199]:
vals
keys
1 2
1 6
1 1
2 4
2 0
Comment From: phofl
Try removing lines from your csv file till the bug disappears
Comment From: rhshadrach
df.groupby('key').nth([0,1,2])
returns something like:
val key 0 60018 0 60022 0 60013 1 60045 1 60019 ...
If this is enough to show the issue, then maybe you can reduce the input to just rows where keys are 0 or 1 (or maybe just 0)?
In pandas 2.0, nth
has been changed to behave like a filter. The result when the input is filtered to keys
being 0 or 1 is now:
val key
12 60013 0
16 60017 1
17 60018 0
18 60019 1
21 60022 0
44 60045 1
Does this agree with your expectations?
I tried fiddling around with more things and noticed another discrepancy as well.
If we try sorting with the stable sort, we get a result that maintains positional order.
My feeling is that there might be a bug with quicksort or however things are being sorted with
nth
?
Are you suggesting quicksort should also preserve order? That is not the case - quicksort is not a stable sorting algorithm.
Comment From: pyrito
@rhshadrach the pandas 2.0 behavior makes more sense to me. That's the behavior that we get when as_index=False
. This appears to only happen when as_index=True
.
Are you suggesting quicksort should also preserve order? That is not the case - quicksort is not a stable sorting algorithm.
You're right, quicksort doesn't preserve order because of how pivoting is done. That point is valid. I was just pointing out that position should be maintained if that was the intention here (e.g. with a stable sort), but I guess this won't really matter anymore with pandas 2.0.
Comment From: rhshadrach
Thanks @pyrito
@rhshadrach the pandas 2.0 behavior makes more sense to me. That's the behavior that we get when
as_index=False
. This appears to only happen whenas_index=True
.
My comment https://github.com/pandas-dev/pandas/issues/51250#issuecomment-1423589439 may have not been clear, the result I posted as the behavior for 2.0 is with either as_index=True
or as_index=False
. In general, as_index
only has an impact on reductions.
I think the issue has been resolved, if you disagree comment here and we can reopen.