Code Sample, a copy-pastable example if possible
import pandas as pd
#Create sample dataframe
test = pd.DataFrame({"a":["12", float("nan"), "3", "meow" ]})
# Select rows that are null or numeric:
print len(test[ test["a"].isnull() | test["a"].str.isnumeric()])
# Select rows that are numeric or null
print len(test[ test["a"].str.isnumeric() | test["a"].isnull() ])
# These two should return the same result, right? Unfortunately they don't
Notice that Series.str.isnumeric
throws an exception on null entries. However, I believe that this may introduce bugs silently
Expected Output
Expected output is: 2 2
Instead, I get: 3 2
output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.12.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None
pandas: 0.18.1
Comment From: TomAugspurger
FWIW, we follow python's logic here
In [40]: float('nan') or 1
Out[40]: nan
In [41]: 1 or float('nan')
Out[41]: 1
Your example is similar, the NaN | True
vs True | NaN
is causing the difference.
Thanks to short-circuiting, I don't think you can rely on |
and or
to be symmetrical.
In [44]: True or this_variable_is_undefined
Out[44]: True
The reverse of that throws a NameError.
However, I believe that this may introduce bugs silently
I suspect you're right 😄 I don't think we can change this behavior though.
Comment From: TomAugspurger
I should also say that you can get the desired output by test.a.str.isnumeric().fillna(False) | test.a.isnull()
Comment From: josepablog
I think I didn't phrase my issue correctly. We agree that short-circuiting may break commutativity.
I'm not talking about NaN | True. I'm talking about: code that raises an exception | True
If you run isnumeric
by itself, on my sample code, an exception is raised for the null values. The exception is NEVER raised when it's inside an or!
Comment From: TomAugspurger
Ah, you're saying test['a'].str.isnumeric()
raises an exception? I don't get one running master.
Comment From: josepablog
@TomAugspurger thank you for getting back to me. This raises an exception to me:
test[ test["a"].str.isnumeric() ]
Which is the simplified version we were testing. The code you wrote, doesn't raise an exception for me either.
Comment From: sinhrks
Yeah it raises ValueError
because Series
contains NaN
.
# on current master
test["a"].str.isnumeric()
# 0 True
# 1 NaN
# 2 True
# 3 False
# Name: a, dtype: object
I think .str.is_xxx
method should return False
rather than NaN
if input is not string likes.
# expected
test["a"].str.isnumeric()
# 0 True
# 1 False
# 2 True
# 3 False
# Name: a, dtype: object
Comment From: josepablog
Possibly. I'm more worried about inconsistent behavior inside an or. Look:
- test[ test["a"].str.isnumeric() ]
Raises exception, which may be appropriate behavior
- test[ test["a"].isnull() | test["a"].str.isnumeric()]
Does not raise exception due to short-circuiting, therefore this is appropriate behavior
- test[ test["a"].str.isnumeric() | test["a"].isnull() ]
Does not raise an exception, although it should!
Comment From: jorisvandenbossche
This boils down to the Series behaviour for |
with Nans:
In [10]: pd.Series([True, np.nan]) | pd.Series([False, True])
Out[10]:
0 True
1 False
dtype: bool
In [11]: pd.Series([False, True]) | pd.Series([True, np.nan])
Out[11]:
0 True
1 True
dtype: bool
So the comparison never propagates the NaN
. For this reason, you do not get any exception.
But, the difference between both is a bit surprising.
Comment From: jorisvandenbossche
The issue about NaNs in booleans is already discussed in https://github.com/pydata/pandas/issues/6528, so closing here as a duplicate.
Comment From: jorisvandenbossche
test[ test["a"].str.isnumeric() | test["a"].isnull() ]Does not raise an exception, although it should!
@josepablog Whether it should raise or not depends on the outcome of np.nan | True/False
. This currently returns False. If we decide that False
is the correct outcome, then the above code snippet should not raise. However, it could be argued that np.nan | True/False
should return np.nan
, and in that case the above code snippet will raise. But the return value of np.nan | True/False
is rather independent of this specific use case, and as in my previous comment, this is covered in #6528
Comment From: sinhrks
@jorisvandenbossche, @TomAugspurger what do you think .str
accessor to return error case, NaN
or False
?
Comment From: jorisvandenbossche
Ah yes, forgot that part of the issue. Is there any precendence for other of the string methods on how they handle errors / NaNs?
Comment From: TomAugspurger
@sinhrks I prefer the current behavior of str.is_*
propagating NAs.
1. It seems more consistent with other non-reducing behavior for how it treats NAs (eg. nan + 1 is nan)
2. If you seems easier to followup the is_*
with a fillna than to somehow recover the NA
s if you do want to propagate them.
We could document this better. This is a common enough gotcha that we should include a warning box in the prose docs. Another option is to add a fill_value
to all the is*
methods for NA behavior, defaulting to None/np.nan
. That would give the behavior a bit more visibility.