might be a duplicate of https://github.com/pandas-dev/pandas/issues/49337, but is still an issue on 1.5.2

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas. This seems to have been fixed on main by this line https://github.com/pandas-dev/pandas/blob/4d0a4365870893e232f639af13fea44c7d3ff9d4/pandas/core/indexes/base.py#L3235, but the addention seems unrelated to this bug

Reproducible Example

import pandas as pd

a = pd.Categorical(["a","b"], categories=["a", "b"])
b = pd.Categorical(["a","b"], categories=["b", "a"])

one = pd.Categorical(["1","2"], categories=["1", "2"])
two = pd.Categorical(["1","2"], categories=["2", "1"])

Failing case

dfa = pd.DataFrame({"x": a, "y": one}).set_index(["x", "y"]).sort_index()
dfb = pd.DataFrame({"x": b, "y": two}).set_index(["x", "y"]).sort_index()
print(dfa.index.intersection(dfb.index))
print(dfb.index.intersection(dfa.index))

Output

MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])
MultiIndex([('b', '2'),
            ('a', '1')],
           names=['x', 'y'])
MultiIndex([('b', '2')],
           names=['x', 'y'])
MultiIndex([('b', '2')],
           names=['x', 'y'])

('a', '1') is missing from the intersection

Expected case

dfa = pd.DataFrame({"x": a, "y": one}).set_index(["x", "y"])
dfb = pd.DataFrame({"x": b, "y": two}).set_index(["x", "y"])
print(dfa.index)
print(dfb.index)
print(dfa.index.intersection(dfb.index))
print(dfb.index.intersection(dfa.index))

Output

MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])
MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])
MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])
MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])

Issue Description

When creating a multi-index with categories where the order of the categories is not the same, the result of intersection is missing values. Looking at the code, it looks like we do certain perf optimizations when both indexes are sorted, I assume this sorting is expected to be the same, which is not the case when construction categories like above. It does seem to work to .join the two frames

This might be an unsupported way to use pandas, if that is the case you can just ignore the above. The reason I ended up with different sorted categories is when using pyarrow and reading parquet data with options strings_as_categorials this will create categories in the order of how the values are seen in the column, instead of alphabetic.

Expected Behavior

The correct overlap of the indexes as shown above.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.9.13.final.0 python-bits : 64 OS : Darwin OS-release : 21.4.0 Version : Darwin Kernel Version 21.4.0: Fri Mar 18 00:45:05 PDT 2022; root:xnu-8020.101.4~15/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.5.2 numpy : 1.23.5 pytz : 2022.6 dateutil : 2.8.2 setuptools : 58.1.0 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.7.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 10.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None None

Comment From: phofl

Works on main and is a duplicate of the other issue