Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2]})
print(df)
# A
# 0 0
# 1 1
# 2 2
df['A'].replace(to_replace=2, value=99, inplace=True)
print(df)
# A
# 0 0
# 1 1
# 2 99
df.at[0, 'A'] = pd.NA
df['A'].replace(to_replace=1, value=100, inplace=True)
print(df)
# A
# 0 <NA>
# 1 1 <-- should be 100
# 2 99
Issue Description
Pandas replace function does not seem to work on a column if the column contains at least one pd.NA value
Expected Behavior
replace function should work even if pd.NA values are in the column
Installed Versions
Comment From: phofl
Hi, thanks for your report.
Did you check on 1.4.2 and main? Because this works on both for me.
Comment From: phofl
might need tests
Comment From: johnmantios
Hi @phofl. I tried replicating the issue in 1.4.3 on my local machine and I got the following error:
df["A"].replace(to_replace=1, value=100, inplace=True)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/series.py", line 4960, in replace
return super().replace(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/generic.py", line 6747, in replace
new_data = self._mgr.replace(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 441, in replace
return self.apply(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 304, in apply
applied = getattr(b, f)(**kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 683, in replace
mask = missing.mask_missing(values, to_replace)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/missing.py", line 98, in mask_missing
new_mask = new_mask.to_numpy(dtype=bool, na_value=False)
AttributeError: 'bool' object has no attribute 'to_numpy'
What versions do you use in your own environment? I'd be curious to know since you say it works fine for you. Mine are:
INSTALLED VERSIONS
commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.10.5.final.0 python-bits : 64 OS : Darwin OS-release : 17.7.0 Version : Darwin Kernel Version 17.7.0: Fri Oct 30 13:34:27 PDT 2020; root:xnu-4570.71.82.8~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : el_GR.UTF-8 LOCALE : el_GR.UTF-8
pandas : 1.4.3 numpy : 1.23.0 pytz : 2022.1 dateutil : 2.8.2 setuptools : 58.1.0 pip : 22.0.4 Cython : 0.29.30 pytest : 7.1.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
Comment From: phofl
Hmm good point, does not work on 1.4.2, but on main
Comment From: MikiPWata
Hi there, I'm interested in working on this issue(this will be my first contribution). I'm guessing I'm supposed to write some unit tests related to this bug?
Comment From: NaveenKaidbettu
Hi @phofl,
I have similar error as as @johnmantios , the issue seem to be coming even from 1.4.3 as well. The issue is being thrown on line 100 here of ..\pandas\core\missing.py. There are no accessible definition for to_numpy based on any of the imports.
Comment From: jcbedoyam
I tried solving this issue by overloading the comparation operator __eq__
for the NA
class since comparisons return again <NA>
instead of a boolean. However this fails the arithmetics comparison tests. Is there a reason why the comparison for this class is implemented this way or can I rewrite the comparison tests to fit the new comparison functionality?
Comment From: phofl
NA = NA is again NA, this happens on purpose
Comment From: pspiagicw
Is anybody working on this ? Can I investigate ?
Comment From: Shadimrad
NA = NA is again NA, this happens on purpose
Just out of curiosity may I ask why? @phofl
Comment From: phofl
https://en.m.wikipedia.org/wiki/Three-valued_logic
Kleene logic
Comment From: Shadimrad
Thank you so much!
On Mon, 8 Aug 2022, 1:36 pm Patrick Hoefler, @.***> wrote:
https://en.m.wikipedia.org/wiki/Three-valued_logic
Kleene logic
— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/47480#issuecomment-1208585016, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWI7Q4YTNTVMBUCE6XAYK2LVYFVTZANCNFSM5ZSRIGKA . You are receiving this because you commented.Message ID: @.***>
Comment From: phofl
Works on main
df = pd.DataFrame({'A': [0, 1, 2]}, dtype="Int64")
df.at[0, 'A'] = pd.NA
df['A'].replace(to_replace=1, value=100, inplace=True)
returns
A
0 <NA>
1 100
2 2
Comment From: yuanx749
I would like to take this. @phofl
The bug exists if not specifying dtype
in df = pd.DataFrame({"A": [pd.NA, 1, 2]})
.
Comment From: AkshayJain1995
Hi @phofl , If the issue still exists can I pick this up? Please assign this to me. Thanks!
Comment From: Lahiry
I just did use your reproducible example and it just works fine for me. I guess you can mark this issue as closed as this is probably fixed by now. But please let me know if you continue having the issue, because I´m interested in helping!
Comment From: phofl
We try to add tests if something was fixed without getting closed
Comment From: vsbits
take
Comment From: vsbits
Tested using the Docker image and the bug still exists. It seems to happen only when pd.Series
has dtype object
. If it contains NAs but has dtype declared as Float64 or Int64 it runs just fine.
From what I found, the problem is at the function mask_missing
, at .\pandas\core\missing.py (as @NaveenKaidbettu stated). On line:
new_mask = arr == x
In this situation arr
has type numpy.ndarray
and x
is int
. The evaluation is expected to return a BooleanArray
, but is returning a single bool
instead and raising the exeception @johnmantios posted above. Still don't know why.
Should I send pull request with failing tests?