-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
from pandas.api.types import is_categorical_dtype
s = pd.Series(
["a", "b", "c", "a", "b", "c"],
dtype=pd.SparseDtype(pd.CategoricalDtype(['a', 'b', 'c']))
)
print(s.dtype)
# out: Sparse[category, nan]
print(is_categorical_dtype(s))
# out: False
Problem description
The is_categorical_dtype
function returns False
when sparse. This is inconsistent with other types (example for unsigned integer below).
from pandas.api.types import is_unsigned_integer_dtype
s = pd.Series(
[0, 1, 2, 3],
name="pd_unsigned_integer",
dtype=pd.SparseDtype(pd.UInt8Dtype(), pd.NA)
)
print(s.dtype)
# out: Sparse[UInt8, <NA>]
print(is_unsigned_integer_dtype(s))
# out: True
s = pd.Series(
["a", "b", "c", "a", "b", "c"],
dtype=pd.CategoricalDtype(['a', 'b', 'c'])
)
print(s.dtype)
# out: category
print(is_categorical_dtype(s))
# out: True
Expected Output
print(is_categorical_dtype(s))
# out: True
Output of pd.show_versions()
Comment From: jorisvandenbossche
@sbrugman sidenote: I am assuming you are using this method yourself in one of your packages? If so, please import it from pandas.api.types
(which is public) and not from pandas.core.dtypes.common
(private API subject to change)
Comment From: jorisvandenbossche
@sbrugman Can you describe the use case where you encountered this as a problem? I know you mention the inconsistency with other types, but I am still wondering if this is actually a good idea to change / what the motivation would be (because for sake of consistency, there is also the option to change the other types ..)
Comment From: sbrugman
@jorisvandenbossche At this moment all other types I have tested (is_complex_dtype
, is_bool_dtype
, is_unsigned_integer
) all return their value based on the dtype of the SparseArray, where is_categorical_dtype
does not. Users expect consistent behaviour.
The case that you made that a dtype check (e.g. is_***_dtype
) should be informative to the API of that series makes sense to me. The user loses no functionality, as the following snippet could be used to get the current result is_sparse(series) and is_***_dtype(series.dtype.subtype)
.
Comment From: jbrockmendel
in the majority of cases where we use is_category_dtype we really mean isinstance(dtype, CategoricalDtype)
. We should use that instead.