Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
index1 = pd.period_range('2022-01-01', periods=5, freq='M')
index2 = pd.Index(['2022-02', '2022-03'])
result = index1.difference(index2)
print(result)
Issue Description
The example creates index1
with 5 months, and index2
with two months. The difference should be an index with three months. However, I'm getting the following:
PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05'], dtype='period[M]')
Looks like this issue was introduced by https://github.com/pandas-dev/pandas/pull/55108.
Expected Behavior
The example should print the following:
Index(['2022-02', '2022-03'], dtype='object')
Pandas 2.1.4 gives the expected behavior.
Installed Versions
Comment From: Aloqeely
Thanks for the report! On 2.2, index1.difference(index2)
does produce incorrect results but index2.difference(index1)
works correctly, on 2.1.4 and before it is the opposite (i.e. index2.difference(index1)
produces wrong result)
PRs are welcome to fix this, ideally fixing both cases here.
Comment From: TiffL
Taking a look, looks like may be due to the difference in dtype
Comment From: tev-dixon
I'll take this.
Can confirm that this bug still exists on the main branch. It seems that the dtype period[M]
is causing the issue. By changing the reproducible example to create index1 like index2 and changing the dtype, we can recreate this issue (see reproducible example below).
import pandas as pd
index1 = pd.Index(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05'], dtype='period[M]')
index2 = pd.Index(['2022-02', '2022-03'])
result = index1.difference(index2)
print(result)
I'm currently looking into the Index.difference
method in the pandas/core/indexes/base.py
.
Comment From: chilin0525
The bug can still be reproduced on the current main branch. I will continue the work from https://github.com/pandas-dev/pandas/pull/59148 to attempt a fix.
Comment From: chilin0525
take
Comment From: chilin0525
Hi @michaelpradel @Aloqeely , before I begin implementation, I would like to clarify whether the following test items' results align with expectations. Thank you!
- case1:
```
index1 = pd.Index(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05'], dtype='period[M]') index2 = pd.Index(['2022-02', '2022-03']) index1.difference(index2) PeriodIndex(['2022-01', '2022-04', '2022-05'], dtype='period[M]') ```
- case 2:
```
index1 = pd.Index(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05'], dtype='period[M]') index2 = pd.Index(['2022-02', '2022-03']) index2.difference(index1) Index(['2022-02', '2022-03'], dtype='object') ```
Comment From: michaelpradel
Looking at the API documentation of Index.difference, I believe case 1 is the desired output.
Note that the example I gave above seems wrong, now that I look at it again. The correct expected output for the above example should be Index(['2022-01', '2022-04', '2022-05'], dtype='object')
.