Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

series = pd.Series([None,1,2,None,3,4,None])

series.mask(series <= 2, -99)
"""
0     NaN
1   -99.0
2   -99.0
3     NaN
4     3.0
5     4.0
6     NaN
dtype: float64
"""

series = series.convert_dtypes()
series.mask(series <= 2, -99)
"""
0    -99
1    -99
2    -99
3    -99
4      3
5      4
6    -99
dtype: Int64
"""

series = series.convert_dtypes(dtype_backend='pyarrow')
series.mask(series <= 2, -99)
"""
0    -99
1    -99
2    -99
3    -99
4      3
5      4
6    -99
dtype: int64[pyarrow]
"""

Issue Description

When using Series.mask on a Series with a NumPy dtype, np.nan is not replaced. However, for Series with Pandas or PyArrow dtypes, pd.NA is replaced. This behavior is inconsistent and makes it difficult to predict the outcome.

Expected Behavior

import pandas as pd

series = pd.Series([None,1,2,None,3,4,None], dtype='int64[pyarrow]')
series.mask(series <= 2, -99)

"""
0    <NA>
1    -99
2    -99
3    <NA>
4      3
5      4
6    <NA>
dtype: int64[pyarrow]
"""

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.9.18 python-bits : 64 OS : Linux OS-release : 3.10.0-1160.15.2.el7.x86_64 Version : #1 SMP Wed Feb 3 15:06:38 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 pip : 24.3.1 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : 5.2.2 matplotlib : 3.9.2 numba : None numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : 2.9.9 pymysql : None pyarrow : 19.0.0 pyreadstat : 1.2.8 pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 xlsxwriter : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Comment From: sanggon6107

Hi @kartoria , I think this is because :

  1. 'series <= 2' returns pandas.NA for pandas.Int64Dtype and int64[pyarrow], whereas it returns False for float64.
series = pd.Series([None,1,2,None,3,4,None])
bool_1 = series < 2
print(bool_1)
"""
0    False
1     True
2    False
3    False
4    False
5    False
6    False
dtype: bool
"""
print(type(bool_1[0]))
"""
<class 'numpy.bool'>
"""

series = pd.Series([None,1,2,None,3,4,None])
series = series.convert_dtypes()
bool_2 = series < 2
print(bool_2)

"""
0     <NA>
1     True
2    False
3     <NA>
4    False
5    False
6     <NA>
dtype: boolean
"""
print(type(bool_2[0]))
"""
<class 'pandas._libs.missing.NAType'>
"""

series = pd.Series([None,1,2,None,3,4,None])
series = series.convert_dtypes(dtype_backend='pyarrow')
bool_3 = series < 2
print(bool_3)
"""
0     <NA>
1     True
2    False
3     <NA>
4    False
5    False
6     <NA>
dtype: bool[pyarrow]
"""
print(type(bool_3[0]))
"""
<class 'pandas._libs.missing.NAType'>
"""
  1. NDFrame.fillna(inplace) replaces pandas.NA with the arg 'inplace'. You could actually see the issue doesn't appear when you call mask() with inplace=True. I think we need to investigate whether this behaviour is intentional as well as whether there would be any unexpected results if we make code changes here.
series = pd.Series([None,1,2,None,3,4,None])
series = series.convert_dtypes(dtype_backend='pyarrow')
series.mask(series <= 2, -99, inplace=True)
print(series)
"""
0    <NA>
1     -99
2     -99
3    <NA>
4       3
5       4
6    <NA>
dtype: int64[pyarrow]
"""

I'll do further investigation and make a PR if there's a nice way to fix this.

Comment From: sanggon6107

take

Comment From: rhshadrach

Thanks for the report! Agreed with the OP's expectation. cc @jorisvandenbossche @WillAyd to double check.

Comment From: WillAyd

Thanks for the report. We should not be filling the pd.NA values - those should propogate on through

Comment From: alherrera-cs

Hi, I'd like to work on this issue. I can investigate the behavior of Series.mask() regarding pd.NA and propose a solution to standardize its behavior across different dtypes. Please let me know if I can be assigned to this issue.

Comment From: sanggon6107

Hi @alherrera-cs, Sure! Since my PR hasn't been approved, I think you can take if you have a nice idea.