• [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
from pandas.api.types import is_categorical_dtype

s = pd.Series(
    ["a", "b", "c", "a", "b", "c"],
    dtype=pd.SparseDtype(pd.CategoricalDtype(['a', 'b', 'c']))
)

print(s.dtype)
# out: Sparse[category, nan]
print(is_categorical_dtype(s))
# out: False

Problem description

The is_categorical_dtype function returns False when sparse. This is inconsistent with other types (example for unsigned integer below).

from pandas.api.types import is_unsigned_integer_dtype

s = pd.Series(
    [0, 1, 2, 3],
    name="pd_unsigned_integer",
    dtype=pd.SparseDtype(pd.UInt8Dtype(), pd.NA)
)
print(s.dtype)
# out: Sparse[UInt8, <NA>]
print(is_unsigned_integer_dtype(s))
# out: True
s = pd.Series(
    ["a", "b", "c", "a", "b", "c"],
    dtype=pd.CategoricalDtype(['a', 'b', 'c'])
)

print(s.dtype)
# out: category
print(is_categorical_dtype(s))
# out: True

Expected Output

print(is_categorical_dtype(s))
# out: True

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : d9fff2792bf16178d4e450fe7384244e50635733 python : 3.7.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.18362 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.1.0 numpy : 1.19.1 pytz : 2020.1 dateutil : 2.8.1 pip : 19.3.1 setuptools : 45.0.0.post20200113 Cython : None pytest : 5.3.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.1.2 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : 3.6.1 tabulate : None xarray : None xlrd : None xlwt : None numba : 0.47.0

Comment From: jorisvandenbossche

@sbrugman sidenote: I am assuming you are using this method yourself in one of your packages? If so, please import it from pandas.api.types (which is public) and not from pandas.core.dtypes.common (private API subject to change)

Comment From: jorisvandenbossche

@sbrugman Can you describe the use case where you encountered this as a problem? I know you mention the inconsistency with other types, but I am still wondering if this is actually a good idea to change / what the motivation would be (because for sake of consistency, there is also the option to change the other types ..)

Comment From: sbrugman

@jorisvandenbossche At this moment all other types I have tested (is_complex_dtype, is_bool_dtype, is_unsigned_integer) all return their value based on the dtype of the SparseArray, where is_categorical_dtype does not. Users expect consistent behaviour.

The case that you made that a dtype check (e.g. is_***_dtype) should be informative to the API of that series makes sense to me. The user loses no functionality, as the following snippet could be used to get the current result is_sparse(series) and is_***_dtype(series.dtype.subtype).

Comment From: jbrockmendel

in the majority of cases where we use is_category_dtype we really mean isinstance(dtype, CategoricalDtype). We should use that instead.