• [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] (optional) I have confirmed this bug exists on the master branch of pandas.


Coming from here, I tried my issue on current HEAD@master and the erroneous behaviour still persists.

Code Sample, a copy-pastable example

import pandas as pd

cat = pd.CategoricalDtype(categories=[1, 2, 3, 4])
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 2, 1], 'b': [4, 3, 2, 3, 1, 2, 3]}, dtype=cat)

# flat categorical index, no MultiIndex
df_cat = df.set_index('a', drop=False)

diff_ind = df_cat.index.difference(df_cat.head(3).index)
print(diff_ind.dtype)  # -> category

union_ind = df_cat.index.union(df_cat.head(3).index)
print(union_ind.dtype)  # -> category


# same set operations on MultiIndex, categorical gets lost
df_cat_multi = df.set_index(['a', 'b'], drop=False)
# just make sure, we have a categorical in the first place
print(df_cat_multi.index.get_level_values(0).dtype)  # -> category
print(df_cat_multi.index.get_level_values(1).dtype)  # -> category

diff_multi_ind = df_cat_multi.index.difference(df_cat_multi.head(3).index)
print(diff_multi_ind.get_level_values(0).dtype)  # -> int64
print(diff_multi_ind.get_level_values(1).dtype)  # -> int64

union_multi_ind = df_cat_multi.index.union(df_cat_multi.head(3).index)
print(union_multi_ind.get_level_values(0).dtype)  # -> int64
print(union_multi_ind.get_level_values(1).dtype)  # -> int64

Problem description

In contrast to set operations involving flat categorical indices, the same operations on multi-indices with categorical level values revert the result back to the underlying datatype.

Expected Output

The set-operations on multi-indices should preserve the categorical dtype of level values.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : ab687aec38668cdb0139ca02d6e37e8c8386ae7d python : 3.8.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.cp1252 pandas : 1.3.0.dev0+849.gab687aec38 numpy : 1.19.5 pytz : 2021.1 dateutil : 2.8.1 pip : 21.0.1 setuptools : 53.1.0 Cython : 0.29.22 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Comment From: mzeitlin11

Thanks for the detailed report, @tscheburaschka! Investigations to fix are certainly welcome!

Comment From: lukemanley

This appears to be fixed on main. Could use a test.