Code Sample, a copy-pastable example if possible
In [93]: ser.eq('nil').sum()
Out[93]: 1
In [94]: ser.replace('nil', pd.NA).eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/missing.py:47: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask = arr == x
Out[94]: 1
In [95]: ser.loc[(ser == 'nil').fillna(False)] = pd.NA
In [104]: ser.eq('nil').sum()
Out[104]: 0
Output of pd.show_versions()
Comment From: MarcoGorelli
Thanks @tsoernes
Could you include a minimal reproducible example? I couldn't reproduce this
In [1]: import pandas as pd
In [2]: ser = pd.DataFrame({'a': ['nil']})
In [3]: ser
Out[3]:
a
0 nil
In [4]: ser.replace('nil', pd.NA)
Out[4]:
a
0 <NA>
In [5]: ser.replace('nil', pd.NA).eq('nil').sum()
Out[5]:
a 0
dtype: int64
Comment From: tsoernes
@MarcoGorelli
In the following example, I extract out two rows from the dataframe and recreate it via a dict. The first row has a 'nil' in the 'amount' column; the second row has no 'nil'. If I only recreate the dataframe with the first row, then the problem does not show itself. When going via JSON, the problem does not persist.
In [282]: di = df.loc[[26123, 26122]].to_dict()
In [283]: di
Out[283]:
<redacted since problem isolated later in thread>
In [284]: df = pd.DataFrame.from_dict(di)
In [285]: df.eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
res_values = method(rvalues)
Out[285]:
id 0
properties 0
url 0
started_at 0
ended_at 0
investee_id 0
lead_investor_id 0
created_at 0
updated_at 0
round 0
amount 1
currency 0
dtype: int64
In [286]: df.replace('nil', pd.NA).eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/missing.py:47: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask = arr == x
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
res_values = method(rvalues)
Out[289]:
id 0
properties 0
url 0
started_at 0
ended_at 0
investee_id 0
lead_investor_id 0
created_at 0
updated_at 0
round 0
amount 1
currency 0
dtype: int64
Comment From: MarcoGorelli
@tsoernes OK, got it - it fails when there is another row with pd.NA
:
>>> ser = pd.DataFrame({'a': ['nil', pd.NA]})
>>> ser.replace('nil', 'anything else')
/home/SERILOCAL/m.gorelli/pandas/pandas/core/missing.py:48: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask = arr == x
a
0 nil
1 <NA>
Comment From: MarcoGorelli
Problem's related to this:
>>> np.array([['nil', pd.NA]]) == 'nil'
False
>>> np.array([['nil', 'anything else']]) == 'nil'
array([[ True, False]])
Comment From: AnnaDaglis
take
Comment From: AnnaDaglis
@MarcoGorelli
Is there any particular approach here you could suggest to fix the bug? There is no issue when using np.nan
, not pd.NA
.
>>> np.array([['nil', np.nan]]) == 'nil'
array([[ True, False]])
Comment From: MarcoGorelli
Hi Anna - gonna be honest, I don't know :) I had a quick go at this one but couldn't figure it out. If you have any specific questions I'd imagine you could tag one of the core team members (e.g. Tom Augspurger) and ask.
On Tue, 3 Mar 2020 13:59 Anna Daglis, notifications@github.com wrote:
@MarcoGorelli https://github.com/MarcoGorelli
Is there any particular approach here you could suggest to fix the bug? There is no issue when using np.nan, not pd.NA.
np.array([['nil', np.nan]]) == 'nil' array([[ True, False]])
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/32075?email_source=notifications&email_token=AH7QVMAQXWDFDIOEHDX3T7TRFUENRA5CNFSM4KXDNN42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENTSZQI#issuecomment-593964225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7QVMHQJKUXY46CWPZW2NDRFUENRANCNFSM4KXDNN4Q .
Comment From: AnnaDaglis
The problem in this case lies in pandas.core.missing
, mask_missing(arr, values_to_mask)
function, the following clause:
if is_numeric_v_string_like(arr, x):
# GH#29553 prevent numpy deprecation warnings
mask = False
else:
mask = arr == x
The problem arises here arr == x
, which fails when arr
contains pd.NA
, as pointed out above.
@TomAugspurger @jorisvandenbossche
Do you have any suggestions how to tackle this bug? I noticed that handling pd.NA
behaviour has been discussed in other issues, e.g. this one: https://github.com/pandas-dev/pandas/issues/32265
Comment From: TomAugspurger
I'm not sure offhand, sorry. But it seems like NA values should be considered False there, so perhaps fill it with something other than the value to mask? Some of the methods in core.arrays.boolean
may be helpful to look through.
Comment From: tsoernes
Can we get this for 1.0.2 that would be awesome
Comment From: jorisvandenbossche
Note that this is working fine with string dtype:
In [4]: ser = pd.DataFrame({'a': ['nil', pd.NA]}, dtype="string")
In [5]: ser.replace('nil', 'anything else')
Out[5]:
a
0 anything else
1 <NA>
So it is specifically for NA in object dtype (something we actually don't really test yet, I think).
The problem arises here arr == x, which fails when arr contains pd.NA, as pointed out above.
This might be related to the comparison deprecations in numpy (there are other cases where comparisons return a scalar True/False, instead of doing it element-wise), you also see the deprecation warning. Now in general, numpy might not be able to handle this: the comparison with pd.NA returns pd.NA, so it cannot be put in a boolean ndarray (which numpy expects to do for a comparison, I suppose). So the deprecation might be correct that it will fail in the future.
If we want to handle pd.NA in object dtype better, we will need to start using masks as well, and not rely on numpy behaviour.
For example, also this is wrong:
In [9]: pd.Series([1, pd.NA], dtype=object) >= 1
Out[9]:
0 True
1 False
dtype: bool
(we should probably open a separate issue about "pd.NA in object dtype", and the best way forward here generally)
Comment From: simonjayhawkins
(we should probably open a separate issue about "pd.NA in object dtype", and the best way forward here generally)
32931
Comment From: tsibley
@jorisvandenbossche
Note that this is working fine with string dtype:
As of 1.1.0 at least, it's not working fine:
>>> import pandas as pd
>>> pd.__version__
'1.1.0'
>>> s = pd.Series(["a", "", "b", pd.NA], dtype="string")
>>> s
0 a
1
2 b
3 <NA>
dtype: string
>>> s.replace("", pd.NA)
.../lib/python3.6/site-packages/pandas/core/missing.py:49: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask = arr == x
0 a
1
2 b
3 <NA>
dtype: object
Comment From: tsibley
Note however, that the one-dict-arg form of replace
does appear to work (!):
>>> s.replace({"": pd.NA})
0 a
1 <NA>
2 b
3 <NA>
dtype: object
but as described in #35268, it changes the dtype of the Series from whatever it was to object.
I'd (naively, perhaps) expect two-arg replace and one-dict-arg replace to both use the same codepath internally!
Comment From: tsibley
But it seems like NA values should be considered False there, so perhaps fill it with something other than the value to mask?
This makes sense to me and matches my debugging. I'm testing this patch now:
diff --git a/pandas/core/missing.py b/pandas/core/missing.py
index 7802c5cbd..be1632315 100644
--- a/pandas/core/missing.py
+++ b/pandas/core/missing.py
@@ -46,7 +46,7 @@ def mask_missing(arr, values_to_mask):
# GH#29553 prevent numpy deprecation warnings
mask = False
else:
- mask = arr == x
+ mask = (arr == x).fillna(False)
# if x is a string and arr is not, then we get False and we must
# expand the mask to size arr.shape
@@ -57,7 +57,7 @@ def mask_missing(arr, values_to_mask):
# GH#29553 prevent numpy deprecation warnings
mask |= False
else:
- mask |= arr == x
+ mask |= (arr == x).fillna(False)
if na_mask.any():
if mask is None:
Comment From: tsibley
The patch above doesn't work (unsurprisingly) because sometimes arr
is a Pandas array and sometimes it's a NumPy array, but only Pandas arrays have a fillna
method.
The patch below seems to work with some manual testing and the result of ./test_fast.sh
is the same before/after applying it:
diff --git a/pandas/core/missing.py b/pandas/core/missing.py
index 7802c5cbd..f8280c0a2 100644
--- a/pandas/core/missing.py
+++ b/pandas/core/missing.py
@@ -46,7 +46,7 @@ def mask_missing(arr, values_to_mask):
# GH#29553 prevent numpy deprecation warnings
mask = False
else:
- mask = arr == x
+ mask = np.where(~isna(arr), arr, np.full_like(arr, np.nan)) == x
# if x is a string and arr is not, then we get False and we must
# expand the mask to size arr.shape
@@ -57,7 +57,7 @@ def mask_missing(arr, values_to_mask):
# GH#29553 prevent numpy deprecation warnings
mask |= False
else:
- mask |= arr == x
+ mask |= np.where(~isna(arr), arr, np.full_like(arr, np.nan)) == x
if na_mask.any():
if mask is None:
This replaces pd.NA
values with NumPy NaN
and NaT
values, which do compare with ==
correctly and produce a boolean vector instead of a boolean scalar.
I want to write an actual set of tests cases for this bug for various dtypes though before submitting a PR. I believe the patch above may still fail for pd.Period
types.
Comment From: MarcusJellinghaus
Here is another example - at least I think it is the same bug:
import pandas as pd
import numpy as np
series1 = pd.Series([""]).replace("", np.nan) # works
print(series1.values) # [nan]
series2 = pd.Series(["", pd.NA]).replace("", np.nan) # fails
print(series2.values) # ['' <NA>]
series3 = pd.Series(["", pd.NA]).replace(pd.NA, np.nan).replace("", np.nan) # possible workaround
print(series3.values) # [nan nan]
Maybe this would helps to fix the bug.
It would be great if you could fix the bug.
Comment From: phofl
This works now, may need tests