Pandas "Or" does not have commutative property (possibly because it catches exceptions silently?)

Code Sample, a copy-pastable example if possible

import pandas as pd

#Create sample dataframe
test = pd.DataFrame({"a":["12", float("nan"), "3", "meow" ]})

# Select rows that are null or numeric:
print len(test[ test["a"].isnull() | test["a"].str.isnumeric()])

# Select rows that are numeric or null
print len(test[ test["a"].str.isnumeric() | test["a"].isnull() ])

# These two should return the same result, right? Unfortunately they don't

Notice that Series.str.isnumeric throws an exception on null entries. However, I believe that this may introduce bugs silently

Expected Output

Expected output is: 2 2

Instead, I get: 3 2

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 2.7.12.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None

pandas: 0.18.1

Comment From: TomAugspurger

FWIW, we follow python's logic here

In [40]: float('nan') or 1
Out[40]: nan

In [41]: 1 or float('nan')
Out[41]: 1

Your example is similar, the NaN | True vs True | NaN is causing the difference.

Thanks to short-circuiting, I don't think you can rely on | and or to be symmetrical.


In [44]: True or this_variable_is_undefined
Out[44]: True

The reverse of that throws a NameError.

However, I believe that this may introduce bugs silently

I suspect you're right 😄 I don't think we can change this behavior though.

Comment From: TomAugspurger

I should also say that you can get the desired output by test.a.str.isnumeric().fillna(False) | test.a.isnull()

Comment From: josepablog

I think I didn't phrase my issue correctly. We agree that short-circuiting may break commutativity.

I'm not talking about NaN | True. I'm talking about: code that raises an exception | True

If you run isnumeric by itself, on my sample code, an exception is raised for the null values. The exception is NEVER raised when it's inside an or!

Comment From: TomAugspurger

Ah, you're saying test['a'].str.isnumeric() raises an exception? I don't get one running master.

Comment From: josepablog

@TomAugspurger thank you for getting back to me. This raises an exception to me:

test[ test["a"].str.isnumeric() ]

Which is the simplified version we were testing. The code you wrote, doesn't raise an exception for me either.

Comment From: sinhrks

Yeah it raises ValueError because Series contains NaN.

# on current master
test["a"].str.isnumeric() 
# 0     True
# 1      NaN
# 2     True
# 3    False
# Name: a, dtype: object

I think .str.is_xxx method should return False rather than NaN if input is not string likes.

# expected
test["a"].str.isnumeric() 
# 0     True
# 1    False
# 2     True
# 3    False
# Name: a, dtype: object

Comment From: josepablog

Possibly. I'm more worried about inconsistent behavior inside an or. Look: - test[ test["a"].str.isnumeric() ] Raises exception, which may be appropriate behavior - test[ test["a"].isnull() | test["a"].str.isnumeric()] Does not raise exception due to short-circuiting, therefore this is appropriate behavior - test[ test["a"].str.isnumeric() | test["a"].isnull() ]Does not raise an exception, although it should!

Comment From: jorisvandenbossche

This boils down to the Series behaviour for | with Nans:

In [10]: pd.Series([True, np.nan]) | pd.Series([False, True])
Out[10]:
0     True
1    False
dtype: bool

In [11]: pd.Series([False, True]) | pd.Series([True, np.nan])
Out[11]:
0    True
1    True
dtype: bool

So the comparison never propagates the NaN. For this reason, you do not get any exception. But, the difference between both is a bit surprising.

Comment From: jorisvandenbossche

The issue about NaNs in booleans is already discussed in https://github.com/pydata/pandas/issues/6528, so closing here as a duplicate.

Comment From: jorisvandenbossche

test[ test["a"].str.isnumeric() | test["a"].isnull() ]Does not raise an exception, although it should!

@josepablog Whether it should raise or not depends on the outcome of np.nan | True/False. This currently returns False. If we decide that False is the correct outcome, then the above code snippet should not raise. However, it could be argued that np.nan | True/False should return np.nan, and in that case the above code snippet will raise. But the return value of np.nan | True/False is rather independent of this specific use case, and as in my previous comment, this is covered in #6528

Comment From: sinhrks

@jorisvandenbossche, @TomAugspurger what do you think .str accessor to return error case, NaN or False?

Comment From: jorisvandenbossche

Ah yes, forgot that part of the issue. Is there any precendence for other of the string methods on how they handle errors / NaNs?

Comment From: TomAugspurger

@sinhrks I prefer the current behavior of str.is_* propagating NAs. 1. It seems more consistent with other non-reducing behavior for how it treats NAs (eg. nan + 1 is nan) 2. If you seems easier to followup the is_* with a fillna than to somehow recover the NAs if you do want to propagate them.

We could document this better. This is a common enough gotcha that we should include a warning box in the prose docs. Another option is to add a fill_value to all the is* methods for NA behavior, defaulting to None/np.nan. That would give the behavior a bit more visibility.

Pandas "Or" does not have commutative property (possibly because it catches exceptions silently?)

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

output of `pd.show_versions()`