-
[x] I have searched the [pandas] tag on StackOverflow for similar questions.
-
[x] I have asked my usage related question on StackOverflow.
Question about pandas
When using the | operation between two dataframes, the order of the operation can give different results if you have nan, or different index.
df_1 = pd.DataFrame([None], index=[1])
df_2 = pd.DataFrame([True], index=[1])
print(df_1 | df_2)
print(df_2 | df_1)
will give you different results:
0
1 False
0
1 True
The same things with different index:
df_1 = pd.DataFrame([True], index=[1])
df_2 = pd.DataFrame([True], index=[2])
print(df_1 | df_2)
print(df_2 | df_1)
will return
0
1 True
2 False
0
1 False
2 True
Is this a normal behavior or a bug? If it is a normal behavior, what is the logic behind it?
Thanks!
Comment From: mzeitlin11
Thanks for the report @ThomasDelorme! These both look suspicious - I think both are bugs. Think the root cause is along the lines of weirdness with truthy operators for object arrays - see the whole host of issues around #12863 in both numpy
and pandas
.
What looks to be the reason for the lack of commutativity in the first case is that the logic for computing the binary op hits None or True
, which raises and falls back to the give the left operator (which then gets filled with False
when the result is coerced to a bool type). So None or True -> None - > False
and True or None -> True
. This logic can be seen in https://github.com/pandas-dev/pandas/blob/e5f1f9c0d3d26c09bea89f12144bfe732fedb1ea/pandas/_libs/ops.pyx#L244-L250
Didn't look into second case yet, but I'd guess it's an identical issue -> since the indices don't match the frames will be reindexed before computing the binary op, giving the same NaN to bool comparison issues.
Further investigations into fixing this are welcome!
Comment From: corneliusroemer
This would be great if it could be fixed. It contradicts normal logic that NaN | True == False
. If one is True, then or should yield True. Irrespective of what NaN
is.
NaN | False
is trickier, this could be either False
or True
, depending on what NaN
is, so NaN
seems appropriate. Or throwing an error, like numpy does.
This is probably a duplicate: #51267
Also, it would be good to rename the issue from QST:
to BUG:
as this clearly is a bug ;)