Feature Type

  • [x] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

I often create Categorical data structures. In certain circumstances the number of unique categories can be quite large -- the overall length of the Categorical can be very long indeed (hundreds of millions of records). I always create these arrays using the Categorical.from_codes path for performance (my codes are stored in a numpy array). Even still... I would like to bypass an expensive is_unique call that is made during the creation of the categories.

My simple (and somewhat contrived) example:

arr = np.array(list(range(10_000_000)) * 10, dtype=np.int32, order="C")
cats = [f"a{i}" for i in range(10_000_000)]
pd.Categorical.from_codes(codes=arr, categories=cats, validate=False)

shows with cProfile:

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.877    1.877    1.877    1.877 base.py:2313(is_unique)
       93    1.539    0.017    1.539    0.017 {built-in method numpy.array}
        1    0.709    0.709    4.574    4.574 extract_test.py:1(<module>)
        1    0.120    0.120    0.120    0.120 missing.py:305(_isna_string_dtype)
        4    0.092    0.023    0.098    0.024 cast.py:1579(construct_1d_object_array_from_listlike)
        4    0.032    0.008    0.131    0.033 construction.py:517(sanitize_array)

Checking that the categories are unique take a large chunk of time. I've tried to bypass the public API in order to avoid this is_unique call, but keep on running into trouble. And... generally... I would like to stick to public features only. I know with certainty that my categories are unique.

Feature Description

There could be a couple solutions here:

1) Perhaps someone knows how to create a Categorical array very fast assuming that I have pristine data (no Nans, or bad codes, plus guaranteed unique categories)? I'd welcome a solution with current methods!

2) If no solution is currently available, perhaps a new is_unique argument could be introduced to the Categorical.from_codes classmethod (with a safe default of False)? The user could turn this on at their own peril. This doesn't seem to be without precedence:

validate : bool, default True

If True, validate that the codes are valid for the dtype.

If False, don't validate that the codes are valid. Be careful about skipping validation, as invalid codes can lead to severe problems, such as segfaults.

I'm willing risk segfaults for speed.

Many hats off to the pandas team/community. I appreciate your hard work!

Alternative Solutions

not aware of any other package that would satisfy the goal here

Additional Context

No response

Comment From: boxblox

the _simple_new classmethod seems to be the culprit in categorical.py. Specifically, this is_unique call comes in with the update_dtype call.

@classmethod
# error: Argument 2 of "_simple_new" is incompatible with supertype
# "NDArrayBacked"; supertype defines the argument type as
# "Union[dtype[Any], ExtensionDtype]"
def _simple_new(  # type: ignore[override]
    cls, codes: np.ndarray, dtype: CategoricalDtype
) -> Self:
    # NB: This is not _quite_ as simple as the "usual" _simple_new
    codes = coerce_indexer_dtype(codes, dtype.categories)
    dtype = CategoricalDtype(ordered=False).update_dtype(dtype)
    return super()._simple_new(codes, dtype)