- [X ] I have checked that this issue has not already been reported. (sorry if I missed it)
- [X ] I have confirmed this bug exists on the latest version of pandas. (present on 1.2.3)
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
# Set up
import pandas as pd
str_serie = pd.Series(["hello", None, "hi"]) # Note the None
true_serie = pd.Series([True] * 3)
print("1st case")
print(str_serie.str.contains('h') | true_serie) # prints True, False, True
print("2nd case")
print(true_serie | str_serie.str.contains('h')) # prints True, True, True
Problem description
- the commutative property of the
or
or|
isn't respected. - in the 1st case, pandas seems to stop evaluating the boolean expression at the string comparison if it encounters a null type.
Expected Output
In both cases, the output should be True, True, True
.
If pandas were to still ignore boolean expressions after a str operation on a null type, then it should at least print some warnings about it.
Output of pd.show_versions()
Comment From: VodopaDev
In fact, pandas seems to stop evalutating boolean expressions only for rows with a null type string.
str_serie = pd.Series(["1", None, "2"])
true_serie = pd.Series([True] * 3)
(str_serie.str.contains("1")) | true_serie # prints True, False, True
Comment From: mzeitlin11
Thanks for the report @VodopaDev! Can confirm on master, but interestingly this works for the string type:
str_serie = pd.Series(["1", None, "2"], dtype='string')
true_serie = pd.Series([True] * 3)
print((str_serie.str.contains("1")) | true_serie) # prints True, True, True
Investigations welcome to fix!
Comment From: asishm
I guess this can be summarized into
In [119]: pd.Series([True, False, np.nan, None, pd.NA]) | pd.Series([True] * 5)
Out[119]:
0 True
1 True
2 False
3 False
4 True
dtype: bool
In [120]: pd.Series([True] * 5) | pd.Series([True, False, np.nan, None, pd.NA])
Out[120]:
0 True
1 True
2 True
3 True
4 True
dtype: bool
Comment From: simonjayhawkins
I guess this can be summarized into
Agreed. there appears to be an issue with the boolean commutativity of a Series with bool dtype and a Series with object dtype containing None
or np.nan
This issue is not really related to the contains
method on the string accessor.
Indeed, str.contains
accepts a na
parameter so that a Series with bool dtype may be returned.
>>> import pandas as pd
>>>
>>> str_serie = pd.Series(["hello", None, "hi"]) # Note the None
>>> true_serie = pd.Series([True] * 3)
>>>
>>> print("1st case")
1st case
>>> print(str_serie.str.contains("h", na=False) | true_serie) # prints True, True, True
0 True
1 True
2 True
dtype: bool
>>>
>>> print("2nd case")
2nd case
>>> print(true_serie | str_serie.str.contains("h", na=False)) # prints True, True, True
0 True
1 True
2 True
dtype: bool
>>>