Code Sample, a copy-pastable example if possible

In [93]: ser.eq('nil').sum()
Out[93]: 1

In [94]: ser.replace('nil', pd.NA).eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/missing.py:47: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
Out[94]: 1

In [95]: ser.loc[(ser == 'nil').fillna(False)] = pd.NA

In [104]: ser.eq('nil').sum()
Out[104]: 0


Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 5.4.18-100.fc30.x86_64 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : nb_NO.UTF-8 LOCALE : nb_NO.UTF-8 pandas : 1.0.1 numpy : 1.17.3 pytz : 2019.3 dateutil : 2.8.0 pip : 19.3.1 setuptools : 41.6.0.post20191030 Cython : 0.29.13 pytest : 5.2.2 hypothesis : None sphinx : 2.2.1 blosc : None feather : None xlsxwriter : 1.2.2 lxml.etree : 4.4.1 html5lib : 1.0.1 pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.10.3 IPython : 7.9.0 pandas_datareader: None bs4 : 4.8.1 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 2.2.3 numexpr : 2.7.0 odfpy : None openpyxl : 3.0.0 pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.2.2 pyxlsb : None s3fs : None scipy : 1.3.1 sqlalchemy : 1.3.10 tables : 3.5.2 tabulate : 0.8.5 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.2 numba : 0.46.0

Comment From: MarcoGorelli

Thanks @tsoernes

Could you include a minimal reproducible example? I couldn't reproduce this

In [1]: import pandas as pd                                                                                                                                                                                      
In [2]: ser = pd.DataFrame({'a': ['nil']})                                                                                                                                                                       

In [3]: ser                                                                                                                                                                                                      
Out[3]: 
     a
0  nil

In [4]: ser.replace('nil', pd.NA)                                                                                                                                                                                
Out[4]: 
      a
0  <NA>

In [5]: ser.replace('nil', pd.NA).eq('nil').sum()                                                                                                                                                                
Out[5]: 
a    0
dtype: int64

Comment From: tsoernes

@MarcoGorelli

In the following example, I extract out two rows from the dataframe and recreate it via a dict. The first row has a 'nil' in the 'amount' column; the second row has no 'nil'. If I only recreate the dataframe with the first row, then the problem does not show itself. When going via JSON, the problem does not persist.


In [282]: di = df.loc[[26123, 26122]].to_dict()

In [283]: di
Out[283]: 
<redacted since problem isolated later in thread>

In [284]: df = pd.DataFrame.from_dict(di)

In [285]: df.eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
Out[285]: 
id                  0
properties          0
url                 0
started_at          0
ended_at            0
investee_id         0
lead_investor_id    0
created_at          0
updated_at          0
round               0
amount              1
currency            0
dtype: int64

In [286]: df.replace('nil', pd.NA).eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/missing.py:47: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
Out[289]: 
id                  0
properties          0
url                 0
started_at          0
ended_at            0
investee_id         0
lead_investor_id    0
created_at          0
updated_at          0
round               0
amount              1
currency            0
dtype: int64

Comment From: MarcoGorelli

@tsoernes OK, got it - it fails when there is another row with pd.NA:

>>> ser = pd.DataFrame({'a': ['nil', pd.NA]}) 
>>> ser.replace('nil', 'anything else')                                                                                                                                                                  
/home/SERILOCAL/m.gorelli/pandas/pandas/core/missing.py:48: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
      a
0   nil
1  <NA>

Comment From: MarcoGorelli

Problem's related to this:

>>> np.array([['nil', pd.NA]]) == 'nil'
False
>>> np.array([['nil', 'anything else']]) == 'nil'
array([[ True, False]])

Comment From: AnnaDaglis

take

Comment From: AnnaDaglis

@MarcoGorelli Is there any particular approach here you could suggest to fix the bug? There is no issue when using np.nan, not pd.NA.

>>> np.array([['nil', np.nan]]) == 'nil'
array([[ True, False]])

Comment From: MarcoGorelli

Hi Anna - gonna be honest, I don't know :) I had a quick go at this one but couldn't figure it out. If you have any specific questions I'd imagine you could tag one of the core team members (e.g. Tom Augspurger) and ask.

On Tue, 3 Mar 2020 13:59 Anna Daglis, notifications@github.com wrote:

@MarcoGorelli https://github.com/MarcoGorelli

Is there any particular approach here you could suggest to fix the bug? There is no issue when using np.nan, not pd.NA.

np.array([['nil', np.nan]]) == 'nil' array([[ True, False]])

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/32075?email_source=notifications&email_token=AH7QVMAQXWDFDIOEHDX3T7TRFUENRA5CNFSM4KXDNN42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENTSZQI#issuecomment-593964225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7QVMHQJKUXY46CWPZW2NDRFUENRANCNFSM4KXDNN4Q .

Comment From: AnnaDaglis

The problem in this case lies in pandas.core.missing, mask_missing(arr, values_to_mask) function, the following clause:

if is_numeric_v_string_like(arr, x):
    # GH#29553 prevent numpy deprecation warnings
    mask = False
else:
    mask = arr == x

The problem arises here arr == x, which fails when arr contains pd.NA, as pointed out above.

@TomAugspurger @jorisvandenbossche Do you have any suggestions how to tackle this bug? I noticed that handling pd.NA behaviour has been discussed in other issues, e.g. this one: https://github.com/pandas-dev/pandas/issues/32265

Comment From: TomAugspurger

I'm not sure offhand, sorry. But it seems like NA values should be considered False there, so perhaps fill it with something other than the value to mask? Some of the methods in core.arrays.boolean may be helpful to look through.

Comment From: tsoernes

Can we get this for 1.0.2 that would be awesome

Comment From: jorisvandenbossche

Note that this is working fine with string dtype:

In [4]: ser = pd.DataFrame({'a': ['nil', pd.NA]}, dtype="string") 

In [5]: ser.replace('nil', 'anything else')
Out[5]: 
               a
0  anything else
1           <NA>

So it is specifically for NA in object dtype (something we actually don't really test yet, I think).

The problem arises here arr == x, which fails when arr contains pd.NA, as pointed out above.

This might be related to the comparison deprecations in numpy (there are other cases where comparisons return a scalar True/False, instead of doing it element-wise), you also see the deprecation warning. Now in general, numpy might not be able to handle this: the comparison with pd.NA returns pd.NA, so it cannot be put in a boolean ndarray (which numpy expects to do for a comparison, I suppose). So the deprecation might be correct that it will fail in the future.

If we want to handle pd.NA in object dtype better, we will need to start using masks as well, and not rely on numpy behaviour.

For example, also this is wrong:

In [9]: pd.Series([1, pd.NA], dtype=object) >= 1
Out[9]: 
0     True
1    False
dtype: bool

(we should probably open a separate issue about "pd.NA in object dtype", and the best way forward here generally)

Comment From: simonjayhawkins

(we should probably open a separate issue about "pd.NA in object dtype", and the best way forward here generally)

32931

Comment From: tsibley

@jorisvandenbossche

Note that this is working fine with string dtype:

As of 1.1.0 at least, it's not working fine:

>>> import pandas as pd
>>> pd.__version__
'1.1.0'

>>> s = pd.Series(["a", "", "b", pd.NA], dtype="string")
>>> s
0       a
1        
2       b
3    <NA>
dtype: string

>>> s.replace("", pd.NA)
.../lib/python3.6/site-packages/pandas/core/missing.py:49: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
0       a
1        
2       b
3    <NA>
dtype: object

Comment From: tsibley

Note however, that the one-dict-arg form of replace does appear to work (!):

>>> s.replace({"": pd.NA})
0       a
1    <NA>
2       b
3    <NA>
dtype: object

but as described in #35268, it changes the dtype of the Series from whatever it was to object.

I'd (naively, perhaps) expect two-arg replace and one-dict-arg replace to both use the same codepath internally!

Comment From: tsibley

But it seems like NA values should be considered False there, so perhaps fill it with something other than the value to mask?

This makes sense to me and matches my debugging. I'm testing this patch now:

diff --git a/pandas/core/missing.py b/pandas/core/missing.py
index 7802c5cbd..be1632315 100644
--- a/pandas/core/missing.py
+++ b/pandas/core/missing.py
@@ -46,7 +46,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask = False
             else:
-                mask = arr == x
+                mask = (arr == x).fillna(False)

             # if x is a string and arr is not, then we get False and we must
             # expand the mask to size arr.shape
@@ -57,7 +57,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask |= False
             else:
-                mask |= arr == x
+                mask |= (arr == x).fillna(False)

     if na_mask.any():
         if mask is None:

Comment From: tsibley

The patch above doesn't work (unsurprisingly) because sometimes arr is a Pandas array and sometimes it's a NumPy array, but only Pandas arrays have a fillna method.

The patch below seems to work with some manual testing and the result of ./test_fast.sh is the same before/after applying it:

diff --git a/pandas/core/missing.py b/pandas/core/missing.py
index 7802c5cbd..f8280c0a2 100644
--- a/pandas/core/missing.py
+++ b/pandas/core/missing.py
@@ -46,7 +46,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask = False
             else:
-                mask = arr == x
+                mask = np.where(~isna(arr), arr, np.full_like(arr, np.nan)) == x

             # if x is a string and arr is not, then we get False and we must
             # expand the mask to size arr.shape
@@ -57,7 +57,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask |= False
             else:
-                mask |= arr == x
+                mask |= np.where(~isna(arr), arr, np.full_like(arr, np.nan)) == x

     if na_mask.any():
         if mask is None:

This replaces pd.NA values with NumPy NaN and NaT values, which do compare with == correctly and produce a boolean vector instead of a boolean scalar.

I want to write an actual set of tests cases for this bug for various dtypes though before submitting a PR. I believe the patch above may still fail for pd.Period types.

Comment From: MarcusJellinghaus

Here is another example - at least I think it is the same bug:

import pandas as pd import numpy as np series1 = pd.Series([""]).replace("", np.nan) # works print(series1.values) # [nan] series2 = pd.Series(["", pd.NA]).replace("", np.nan) # fails print(series2.values) # ['' <NA>] series3 = pd.Series(["", pd.NA]).replace(pd.NA, np.nan).replace("", np.nan) # possible workaround print(series3.values) # [nan nan]

Maybe this would helps to fix the bug.

It would be great if you could fix the bug.

Comment From: phofl

This works now, may need tests