Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
for extension_dtype in ["Int64", "Float64"]:
pandas_float_data = pd.Series([1, None, 3], dtype=extension_dtype)
numpy_float_data = pd.Series([1, 2, None], dtype="float64")
subtracted_floats = pandas_float_data.subtract(numpy_float_data)
# Resulting series is Float64 extension dtype
assert str(subtracted_floats.dtype) == "Float64"
# The null value from numpy_float_data is np.nan
np_nan_value = subtracted_floats.iloc[-1]
assert np.isnan(np_nan_value)
assert np_nan_value is not pd.NA
# The null value from pandas_float_data is pd.NA
pd_na_value = subtracted_floats.iloc[-2]
assert pd_na_value is pd.NA
# Only the pd.NA null value is dropped - doesn't recognize np.nan
dropped_nans = subtracted_floats.dropna()
assert len(dropped_nans) == 2
Issue Description
There are two problems here
1. Subtracting data with the pandas extension dtypes from data with numpy dtype float64
can result in np.nan
being present in the resulting Float64
data if a null value was present in the original float64
data. This is incorrect, as I would expect to only have pd.NA
as the null value in a Float64
series
2. Because there is an unexpected null value in the resulting Float64
series, Series.dropna
doesn't recognize that null value and will not drop it from the Series.
Note: The behavior is the same whether we subtract the pandas dtype from the numpy dtype or visa versa. IT is also the same if you use a.subtract(b)
or a - b
.
Expected Behavior
The following should work:
for extension_dtype in ["Int64", "Float64"]:
pandas_float_data = pd.Series([1, None, 3], dtype=extension_dtype)
numpy_float_data = pd.Series([1, 2, None], dtype="float64")
subtracted_floats = pandas_float_data - numpy_float_data
# Resulting series is Float64 extension dtype
assert str(subtracted_floats.dtype) == "Float64"
# The null value from numpy_float_data is pd.NA
np_nan_value = subtracted_floats.iloc[-1]
# assert np.isnan(np_nan_value)
assert np_nan_value is pd.NA
# The null value from pandas_float_data is pd.NA
pd_na_value = subtracted_floats.iloc[-2]
assert pd_na_value is pd.NA
# All null values are dropped
dropped_nans = subtracted_floats.dropna()
assert len(dropped_nans) == 1
Installed Versions
Comment From: phofl
Hi, thanks for your report. This is as expected for now, there is an ongoing discussion about NA and NaN in Float64
Comment From: tamargrey
@phofl I can understand the presence of both NA and nan as being expected, but how is the dropna behavior expected here? Shouldn't it be able to recognize any and all null values and remove them all? I'd be happy to open up a more specific issue if needed.
Is there a better way for me to be removing nans in this situation? If it's converting to "float64" or replacing nan with NA and then dropping, I don't see how it could be the expected behavior.
Comment From: phofl
There are also open issues about that (i think more related to fillna). As I said it’s still an ongoing discussion
Comment From: tamargrey
Got it. I see this has been ongoing for quite a while in https://github.com/pandas-dev/pandas/issues/32265.