Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

index1 = pd.period_range('2022-01-01', periods=5, freq='M')
index2 = pd.Index(['2022-02', '2022-03'])
result = index1.difference(index2)
print(result)

Issue Description

The example creates index1 with 5 months, and index2 with two months. The difference should be an index with three months. However, I'm getting the following:

PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05'], dtype='period[M]')

Looks like this issue was introduced by https://github.com/pandas-dev/pandas/pull/55108.

Expected Behavior

The example should print the following:

Index(['2022-02', '2022-03'], dtype='object')

Pandas 2.1.4 gives the expected behavior.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 6dbeeb4009bbfac5ea1ae2111346f5e9f05b81f4 python : 3.10.8.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-35-generic Version : #35-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 26 11:23:57 UTC 2024 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.0rc0+28.g6dbeeb4009 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 63.2.0 pip : 24.0 Cython : 3.0.10 pytest : 8.2.0 hypothesis : 6.100.2 sphinx : 7.3.7 blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.2.1 html5lib : 1.1 pymysql : 1.4.6 psycopg2 : 2.9.9 jinja2 : 3.1.3 IPython : 8.24.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.3.8 fastparquet : 2024.2.0 fsspec : 2024.3.1 gcsfs : 2024.3.1 matplotlib : 3.8.4 numba : 0.59.1 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.2 pyarrow : 16.0.0 pyreadstat : 1.2.7 python-calamine : None pyxlsb : 1.0.10 s3fs : 2024.3.1 scipy : 1.13.0 sqlalchemy : 2.0.29 tables : 3.9.2 tabulate : 0.9.0 xarray : 2024.3.0 xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2024.1 qtpy : None pyqt5 : None

Comment From: Aloqeely

Thanks for the report! On 2.2, index1.difference(index2) does produce incorrect results but index2.difference(index1) works correctly, on 2.1.4 and before it is the opposite (i.e. index2.difference(index1) produces wrong result)

PRs are welcome to fix this, ideally fixing both cases here.

Comment From: TiffL

Taking a look, looks like may be due to the difference in dtype

Comment From: tev-dixon

I'll take this.

Can confirm that this bug still exists on the main branch. It seems that the dtype period[M] is causing the issue. By changing the reproducible example to create index1 like index2 and changing the dtype, we can recreate this issue (see reproducible example below).

import pandas as pd
index1 = pd.Index(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05'], dtype='period[M]')
index2 = pd.Index(['2022-02', '2022-03'])
result = index1.difference(index2)
print(result)

I'm currently looking into the Index.difference method in the pandas/core/indexes/base.py.

Comment From: chilin0525

The bug can still be reproduced on the current main branch. I will continue the work from https://github.com/pandas-dev/pandas/pull/59148 to attempt a fix.

Comment From: chilin0525

take

Comment From: chilin0525

Hi @michaelpradel @Aloqeely , before I begin implementation, I would like to clarify whether the following test items' results align with expectations. Thank you!

  • case1: ```

    index1 = pd.Index(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05'], dtype='period[M]') index2 = pd.Index(['2022-02', '2022-03']) index1.difference(index2) PeriodIndex(['2022-01', '2022-04', '2022-05'], dtype='period[M]') ```

  • case 2: ```

    index1 = pd.Index(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05'], dtype='period[M]') index2 = pd.Index(['2022-02', '2022-03']) index2.difference(index1) Index(['2022-02', '2022-03'], dtype='object') ```

Comment From: michaelpradel

Looking at the API documentation of Index.difference, I believe case 1 is the desired output.

Note that the example I gave above seems wrong, now that I look at it again. The correct expected output for the above example should be Index(['2022-01', '2022-04', '2022-05'], dtype='object').