Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
mi1 = pd.MultiIndex.from_frame(
pd.DataFrame(dict(i1=pd.Series(["b", "a"]), i2=1)),
)
print(mi1)
# This index is the same except the left index labels belong to a categorical dtype
cat_dt = pd.CategoricalDtype(["b", "a"], ordered=True)
mi2 = pd.MultiIndex.from_frame(
pd.DataFrame(dict(i1=pd.Series(["b", "a"], dtype=cat_dt), i2=1))
)
print(mi2)
# These behave as expected.
print(mi1.intersection(mi1[1:]))
print(mi1.intersection(mi2[1:]))
# These do not.
print(mi2.intersection(mi1[1:]))
print(mi2.intersection(mi2[1:]))
Issue Description
From the example you can see that index intersection doesn't work properly for MultiIndex
es when there is a categorical dtype. I think what's happening is that the intersection method sees that mi2
is monotonic, so it passes the array [('b', 1), ('a', 1)]
to inner_join_indexer()
, which gets the wrong answer, because mi2
is only monotonic with respect to the categorical dtype.
Expected Behavior
All four intersections in the example should return the following.
MultiIndex([('a', 1)],
names=['i1', 'i2'])
Installed Versions
Comment From: deepers
I just checked out and built the main branch of pandas (2.0.0.dev0+466.g218ab0930e4 218ab09
), and the example does not reproduce. The bug seems to be fixed in the main branch.
Comment From: topper-123
Hi @deepers. Thanks for the bug rapport, it's much appreciated.
I've checked this with pandas v.1.4, where this does give a wrong result and current main branch where it's all good. So I agree that this works now, but previously didn't.
Could you add a test case for this in the pandas test suite, so this won't pop up again later?
Comment From: deepers
Hi @topper-123. Thanks, I will do this. But in following the instructions for building the pandas development environment with mamba, I ran into the following error. Any ideas?
deepee@entropy ~/r/pandas (multiindex_categorical_intersection)> python -m pip install -e . --no-build-isolation --no-use-pep517 (pandas-dev)
Obtaining file:///home/deepee/repo/pandas
Preparing metadata (setup.py) ... done
Requirement already satisfied: python-dateutil>=2.8.2 in /home/deepee/.local/opt/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+512.geb69d8943f) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/deepee/.local/opt/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+512.geb69d8943f) (2022.5)
Requirement already satisfied: numpy>=1.20.3 in /home/deepee/.local/opt/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from pandas==2.0.0.dev0+512.geb69d8943f) (1.23.4)
Requirement already satisfied: six>=1.5 in /home/deepee/.local/opt/mambaforge/envs/pandas-dev/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas==2.0.0.dev0+512.geb69d8943f) (1.16.0)
Installing collected packages: pandas
Attempting uninstall: pandas
Found existing installation: pandas 1.5.1
Uninstalling pandas-1.5.1:
Successfully uninstalled pandas-1.5.1
Running setup.py develop for pandas
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
db-dtypes 1.0.4 requires pandas<2.0dev,>=0.24.2, but you have pandas 2.0.0.dev0+512.geb69d8943f which is incompatible.
Successfully installed pandas-2.0.0.dev0+512.geb69d8943f