Feature Type
-
[x] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
I often create Categorical
data structures. In certain circumstances the number of unique categories can be quite large -- the overall length of the Categorical
can be very long indeed (hundreds of millions of records). I always create these arrays using the Categorical.from_codes
path for performance (my codes are stored in a numpy
array). Even still... I would like to bypass an expensive is_unique
call that is made during the creation of the categories.
My simple (and somewhat contrived) example:
arr = np.array(list(range(10_000_000)) * 10, dtype=np.int32, order="C")
cats = [f"a{i}" for i in range(10_000_000)]
pd.Categorical.from_codes(codes=arr, categories=cats, validate=False)
shows with cProfile
:
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.877 1.877 1.877 1.877 base.py:2313(is_unique)
93 1.539 0.017 1.539 0.017 {built-in method numpy.array}
1 0.709 0.709 4.574 4.574 extract_test.py:1(<module>)
1 0.120 0.120 0.120 0.120 missing.py:305(_isna_string_dtype)
4 0.092 0.023 0.098 0.024 cast.py:1579(construct_1d_object_array_from_listlike)
4 0.032 0.008 0.131 0.033 construction.py:517(sanitize_array)
Checking that the categories are unique take a large chunk of time. I've tried to bypass the public API in order to avoid this is_unique
call, but keep on running into trouble. And... generally... I would like to stick to public features only. I know with certainty that my categories are unique.
Feature Description
There could be a couple solutions here:
1) Perhaps someone knows how to create a Categorical
array very fast assuming that I have pristine data (no Nans, or bad codes, plus guaranteed unique categories)? I'd welcome a solution with current methods!
2) If no solution is currently available, perhaps a new is_unique
argument could be introduced to the Categorical.from_codes
classmethod
(with a safe default of False
)? The user could turn this on at their own peril. This doesn't seem to be without precedence:
validate : bool, default True
If True, validate that the codes are valid for the dtype.
If False, don't validate that the codes are valid. Be careful about skipping validation, as invalid codes can lead to severe problems, such as segfaults.
I'm willing risk segfaults for speed.
Many hats off to the pandas team/community. I appreciate your hard work!
Alternative Solutions
not aware of any other package that would satisfy the goal here
Additional Context
No response
Comment From: boxblox
the _simple_new
classmethod
seems to be the culprit in categorical.py
. Specifically, this is_unique
call comes in with the update_dtype
call.
@classmethod
# error: Argument 2 of "_simple_new" is incompatible with supertype
# "NDArrayBacked"; supertype defines the argument type as
# "Union[dtype[Any], ExtensionDtype]"
def _simple_new( # type: ignore[override]
cls, codes: np.ndarray, dtype: CategoricalDtype
) -> Self:
# NB: This is not _quite_ as simple as the "usual" _simple_new
codes = coerce_indexer_dtype(codes, dtype.categories)
dtype = CategoricalDtype(ordered=False).update_dtype(dtype)
return super()._simple_new(codes, dtype)