- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandas.
- [ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Problem description
When trying to replace all pd.NA
values with None
using where(...)
on a pd.Series
with dtype pd.Int64Dtype
, None
values are automatically coerced to pd.NA
in a way inconsistent with other dtypes. Additionally, the try_cast
flag doesn't control this behavior.
Expected Output
I expect that when try_cast=False
, None
will not be coerced into np.NA
, and the dtype of the series will instead be cast into object
to accommodate the newly introduced None
value, as occurs for other dtypes.
Example
In the following:
s = pd.Series([1, 2, 3, pd.NA], dtype=pd.Int64Dtype())
s.where(pd.notnull(s), None)
the None
I'm trying to replace pd.NA
with is automatically cast back into pd.NA
. This occurs even though try_cast
is set to False
by default, which I would expect to control this casting behavior.
0 1
1 2
2 3
3 <NA>
dtype: Int64
This is in contrast to the behavior of other dtypes, like numpy's float dtype:
s = pd.Series([1, 2, 3, np.nan], dtype=np.dtype("float64"))
s.where(pd.notnull(s), None)
which outputs the series with the dtype coerced to object
after incorporating None
:
0 1
1 2
2 3
3 None
dtype: object
Output of pd.show_versions()
Comment From: jorisvandenbossche
If you convert it to object
dtype first, then it actually replaces NA with None:
>>> s.astype(object).where(pd.notnull(s), None)
0 1
1 2
2 3
3 None
dtype: object
Personally, I might find it good behaviour that it actually by default tries to preserve the dtype. But it's certainly inconsistent with how it works for other dtypes.
Comment From: jorisvandenbossche
The question is a bit: what values are considered "valid" for the current dtype, and what not?
Because for example for the numpy integer dtype, we also coerce the other
if possible to some extent.
Here I provide a float, but it still results in an int series:
In [51]: s = pd.Series([1, 2, 3])
In [52]: s.where(s == 1, 2.0)
Out[52]:
0 1
1 2
2 2
dtype: int64
And only when the value cannot be interpreted as an int, it gets upcasted to float or object dtype.
Now, for the nullable integer dtype, we actually raise an error if it doesn't fit in integers:
In [62]: s = pd.Series([1, 2, 3], dtype="Int64")
In [63]: s.where(s == 1, 2.5)
...
TypeError: cannot safely cast non-equivalent object to int64
(so in that light coercing None to NA certainly makes sense)
Comment From: jorisvandenbossche
Related where
issue: https://github.com/pandas-dev/pandas/issues/24144
Comment From: jbrockmendel
@jorisvandenbossche is right, the solution here is to cast to object dtype
Comment From: phofl
Closing as expected behavior, masked dtypes don't upcast when setting incompatible values
Comment From: jbrockmendel
I think this is worth keeping open, as the lack of consistency with other dtypes is a pretty common complaint.
Comment From: phofl
I'd rather open an issue about documenting this? Was planning on doing this anyway after the rc for 2.0 is out
Comment From: jbrockmendel
The issue isnt the docs its the desired behavior. I'd be open to changing/deprecating the non-nullable behavior to not cast, but think we should be consistent throughout.