Pandas version checks
- [x] I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Documentation problem
The docs miss an explanation of sort=False for groupby. Does the order among groups with their keys follow the order of appearance of the keys in the original data frame? Or the groups may be out of order?
Suggested fix for documentation
When setting sort=False for groupby. One may want the order among groups with their keys follows the order of appearance of the keys in the original data frame. Can this be guaranteed?
Comment From: datapythonista
Thanks for reporting this @easysam. I'm not sure if sort=False
means that you'll get the order in what the keys are found, or an arbitrary order. Do you mind running some tests to see the behavior and update the docstring accordingly? That would be helpful for others with the same question. Thanks!
Comment From: rhshadrach
What happens to the keys is also missing from the User Guide, I think it would be good to add a description there:
https://pandas.pydata.org/pandas-docs/dev/user_guide/groupby.html#groupby-sorting
Comment From: easysam
@datapythonista @rhshadrach
I try to answer this question by understanding the source code.
https://github.com/pandas-dev/pandas/blob/2b1184dd5b7a80120cf4010996a1c91987d9f7fe/pandas/core/groupby/grouper.py#L685
It seems that the algorithms.factorize
is used to calculate the unique keys. The algorithms.factorize
use the hashtable.
https://github.com/pandas-dev/pandas/blob/2b1184dd5b7a80120cf4010996a1c91987d9f7fe/pandas/core/algorithms.py#L249
However, I met several ".pxi.in" files in the hashtable source code. For example: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/hashtable_class_helper.pxi.in I want to know how to use .pxi.in files to generate .pxi files. Is there any tutorials or docs?
I also post this problem in stackoverflow, hoping to help others. https://stackoverflow.com/questions/72798626/what-is-a-pxi-in-file-and-how-to-use-it
Comment From: rhshadrach
The pxi.in files are built here:
https://github.com/pandas-dev/pandas/blob/f4ca4d3d0ea6a907262f8c842c691115b13d4cb7/setup.py#L77-L97
But algorithms.factorize
will code the unique values by order of appearance when sort=False; we have a lot of testing on it in pandas.tests.test_algos
as well as the groupby/indexing tests. The one exception is null values (they are always given the largest code regardless of appearance), but that should be fixed by #46601.