Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
series = pd.Series([None,1,2,None,3,4,None])
series.mask(series <= 2, -99)
"""
0 NaN
1 -99.0
2 -99.0
3 NaN
4 3.0
5 4.0
6 NaN
dtype: float64
"""
series = series.convert_dtypes()
series.mask(series <= 2, -99)
"""
0 -99
1 -99
2 -99
3 -99
4 3
5 4
6 -99
dtype: Int64
"""
series = series.convert_dtypes(dtype_backend='pyarrow')
series.mask(series <= 2, -99)
"""
0 -99
1 -99
2 -99
3 -99
4 3
5 4
6 -99
dtype: int64[pyarrow]
"""
Issue Description
When using Series.mask on a Series with a NumPy dtype, np.nan is not replaced. However, for Series with Pandas or PyArrow dtypes, pd.NA is replaced. This behavior is inconsistent and makes it difficult to predict the outcome.
Expected Behavior
import pandas as pd
series = pd.Series([None,1,2,None,3,4,None], dtype='int64[pyarrow]')
series.mask(series <= 2, -99)
"""
0 <NA>
1 -99
2 -99
3 <NA>
4 3
5 4
6 <NA>
dtype: int64[pyarrow]
"""
Installed Versions
Comment From: sanggon6107
Hi @kartoria , I think this is because :
- 'series <= 2' returns pandas.NA for pandas.Int64Dtype and int64[pyarrow], whereas it returns False for float64.
series = pd.Series([None,1,2,None,3,4,None])
bool_1 = series < 2
print(bool_1)
"""
0 False
1 True
2 False
3 False
4 False
5 False
6 False
dtype: bool
"""
print(type(bool_1[0]))
"""
<class 'numpy.bool'>
"""
series = pd.Series([None,1,2,None,3,4,None])
series = series.convert_dtypes()
bool_2 = series < 2
print(bool_2)
"""
0 <NA>
1 True
2 False
3 <NA>
4 False
5 False
6 <NA>
dtype: boolean
"""
print(type(bool_2[0]))
"""
<class 'pandas._libs.missing.NAType'>
"""
series = pd.Series([None,1,2,None,3,4,None])
series = series.convert_dtypes(dtype_backend='pyarrow')
bool_3 = series < 2
print(bool_3)
"""
0 <NA>
1 True
2 False
3 <NA>
4 False
5 False
6 <NA>
dtype: bool[pyarrow]
"""
print(type(bool_3[0]))
"""
<class 'pandas._libs.missing.NAType'>
"""
- NDFrame.fillna(inplace) replaces pandas.NA with the arg 'inplace'. You could actually see the issue doesn't appear when you call mask() with inplace=True. I think we need to investigate whether this behaviour is intentional as well as whether there would be any unexpected results if we make code changes here.
series = pd.Series([None,1,2,None,3,4,None])
series = series.convert_dtypes(dtype_backend='pyarrow')
series.mask(series <= 2, -99, inplace=True)
print(series)
"""
0 <NA>
1 -99
2 -99
3 <NA>
4 3
5 4
6 <NA>
dtype: int64[pyarrow]
"""
I'll do further investigation and make a PR if there's a nice way to fix this.
Comment From: sanggon6107
take
Comment From: rhshadrach
Thanks for the report! Agreed with the OP's expectation. cc @jorisvandenbossche @WillAyd to double check.
Comment From: WillAyd
Thanks for the report. We should not be filling the pd.NA values - those should propogate on through
Comment From: alherrera-cs
Hi, I'd like to work on this issue. I can investigate the behavior of Series.mask() regarding pd.NA and propose a solution to standardize its behavior across different dtypes. Please let me know if I can be assigned to this issue.
Comment From: sanggon6107
Hi @alherrera-cs, Sure! Since my PR hasn't been approved, I think you can take if you have a nice idea.