Pandas version checks

  • [x] I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Documentation problem

The docs miss an explanation of sort=False for groupby. Does the order among groups with their keys follow the order of appearance of the keys in the original data frame? Or the groups may be out of order?

Suggested fix for documentation

When setting sort=False for groupby. One may want the order among groups with their keys follows the order of appearance of the keys in the original data frame. Can this be guaranteed?

Comment From: datapythonista

Thanks for reporting this @easysam. I'm not sure if sort=False means that you'll get the order in what the keys are found, or an arbitrary order. Do you mind running some tests to see the behavior and update the docstring accordingly? That would be helpful for others with the same question. Thanks!

Comment From: rhshadrach

What happens to the keys is also missing from the User Guide, I think it would be good to add a description there:

https://pandas.pydata.org/pandas-docs/dev/user_guide/groupby.html#groupby-sorting

Comment From: easysam

@datapythonista @rhshadrach I try to answer this question by understanding the source code. https://github.com/pandas-dev/pandas/blob/2b1184dd5b7a80120cf4010996a1c91987d9f7fe/pandas/core/groupby/grouper.py#L685 It seems that the algorithms.factorize is used to calculate the unique keys. The algorithms.factorize use the hashtable. https://github.com/pandas-dev/pandas/blob/2b1184dd5b7a80120cf4010996a1c91987d9f7fe/pandas/core/algorithms.py#L249

However, I met several ".pxi.in" files in the hashtable source code. For example: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/hashtable_class_helper.pxi.in I want to know how to use .pxi.in files to generate .pxi files. Is there any tutorials or docs?

I also post this problem in stackoverflow, hoping to help others. https://stackoverflow.com/questions/72798626/what-is-a-pxi-in-file-and-how-to-use-it

Comment From: rhshadrach

The pxi.in files are built here:

https://github.com/pandas-dev/pandas/blob/f4ca4d3d0ea6a907262f8c842c691115b13d4cb7/setup.py#L77-L97

But algorithms.factorize will code the unique values by order of appearance when sort=False; we have a lot of testing on it in pandas.tests.test_algos as well as the groupby/indexing tests. The one exception is null values (they are always given the largest code regardless of appearance), but that should be fixed by #46601.