We try to consistently return python objects (instead of numpy scalars) in certain functions like tolist
, to_dict
, itertuples/items
, .. (we have had quite some issues fixing this in several cases).
However, currently we don't do that for extension dtypes (and don't have any mechanism to ask for this):
In [33]: type(pd.Series([1, 2], dtype='int64').tolist()[0])
Out[33]: int
In [34]: type(pd.Series([1, 2], dtype='Int64').tolist()[0])
Out[34]: numpy.int64
In [36]: type(pd.Series([1, 2], dtype='int64').to_dict()[0])
Out[36]: int
In [37]: type(pd.Series([1, 2], dtype='Int64').to_dict()[0])
Out[37]: numpy.int64
In [45]: s = pd.Series([1, 2], dtype='int64')
In [46]: type(list(s.iteritems())[0][1])
Out[46]: int
In [47]: s = pd.Series([1, 2], dtype='Int64')
In [48]: type(list(s.iteritems())[0][1])
Out[48]: numpy.int64
Should we add some API to ExtensionArray to provide this? Eg a method to iterate through the elements that returns "native" objects?
Comment From: jorisvandenbossche
Actually, for Series, the __iter__
also returns native types, so maybe fixing that is enough: ensure that __iter__
on IntegerArray etc do return the python objects (so by not plainly using __getitem__
, which correctly returns numpy scalars):
In [57]: s = pd.Series([1, 2], dtype='int64')
In [58]: type(s[0])
Out[58]: numpy.int64
In [59]: type(list(iter(s))[0])
Out[59]: int
In [60]: s = pd.Series([1, 2], dtype='Int64')
In [61]: type(s[0])
Out[61]: numpy.int64
In [62]: type(list(iter(s))[0])
Out[62]: numpy.int64 # <-- fixing this might fix the other cases?
Comment From: jorisvandenbossche
Let's consider this a bug in __iter__
then, because I think all mentioned cases can be solved with (for IntegerArray):
--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -456,7 +456,7 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
if self._mask[i]:
yield self.dtype.na_value
else:
- yield self._data[i]
+ yield self._data[i].item()
Comment From: marco-neumann-by
So ExtensionArray.__iter__
should return native types then? Is this also true for ExtensionArray.__getitem__
with some integer? I think it should at least be documented then so that other (external) implementations can get this right.
Comment From: jorisvandenbossche
Is this also true for ExtensionArray.getitem with some integer?
If we mimic what Series with plain numpy dtype does, then getitem should keep returning the numpy scalar.
Comment From: victorbr92
Hi, @marco-neumann-jdas and @jorisvandenbossche, are you guys working on this issue ? Could I help somehow and make a PR for it (it was marked as a good first issue). I tested the changes proposed by @jorisvandenbossche locally and it fixes the reported issue indeed.
Comment From: marco-neumann-by
I am not working on it.
Keep in mind that this issue is not only about fixing the behavior of IntegerArray
but also to adjust the docs of ExtensionArray.__iter__
to state that EVERY EA should return Python-native types.
Comment From: mroeschke
Similarly, should pd.NA (ExtensionDtype.na_value
) be converted to None
as the native equivalent?
In [1]: pd.Series([1, 2, None], dtype="Int64").tolist()
Out[1]: [1, 2, <NA>] # should <NA> be None?
Comment From: Peque
Stumbled upon this issue when trying to serialize a DataFrame resulted from .isocalendar()
:
import json
from pandas import date_range
df = date_range("2021-01-01", freq="D", periods=7).isocalendar()
json.dumps(df.to_dict(orient="list"))
# TypeError: Object of type uint32 is not JSON serializable
Of course, this can be reduced to an example like those presented above by @jorisvandenbossche:
type(pd.Series([1, 2], dtype='UInt32').tolist()[0])
# numpy.int32
Just leaving this comment in case someone looks for "isocalendar" or "not JSON serializable".
Comment From: lukemanley
@Peque - your example cases have been fixed on the main branch and will be included in 2.0. On main, your first example works without raising and your second example returns int
.
Comment From: lukemanley
Similarly, should pd.NA (ExtensionDtype.na_value) be converted to None as the native equivalent?
@mroeschke - was there ever an answer to this? pd.NA
is still being returned
Comment From: Peque
@lukemanley Great to know and thanks for sharing! :blush:
Comment From: mroeschke
@mroeschke - was there ever an answer to this? pd.NA is still being returned
Personally I think it makes sense to convert pd.NA to None as the Python native type. @jorisvandenbossche might have thoughts on this as well
Comment From: lukemanley
50796 changed Series.to_dict
to return None
in place of pd.NA
which seems to be the preferred behavior as it is consistent with returning native types. I took a look at doing the same for tolist
and __iter__
.
I'll note that changing the value for __iter__
would result in the following list comprehension to start raising:
arr = pd.array([1, 2, pd.NA], dtype="Int64")
[v+1 for v in arr] # <- TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
I'm curious if that is too big of a change? Since series.tolist() == list(series)
is tested behavior I suspect tolist
and __iter__
need to remain consistent with each other.
Any suggestions for moving this forward from here? Its probably not ideal that to_dict
and tolist
are inconsistent with pd.NA
at the moment.
cc @phofl @mroeschke
Comment From: lukemanley
One more comment here. I think there may be a case for making tolist
return None
in place of pd.NA
but not __iter__
. This would make tolist
consistent with to_dict
. It would also be consistent with how numpy and pyarrow operate. Both numpy and pyarrow have .tolist
methods that return python native types and __iter__
methods that return non-native types when iterating through an array.
If we were to take this approach, the currently tested series.tolist() == list(series)
behavior would change.
Example with pyarrow:
In [1]: arr = pa.array([1, None])
In [2]: arr.tolist()
Out[2]: [1, None]
In [3]: list(arr)
Out[3]: [<pyarrow.Int64Scalar: 1>, <pyarrow.Int64Scalar: None>]
Comment From: mroeschke
My initial feeling is that both __iter__
and list
should both return python types with justification that if a user calls list
I think they would expect the elements to behave like native Python objects
Comment From: lukemanley
My initial feeling is that both iter and list should both return python types with justification that if a user calls list I think they would expect the elements to behave like native Python objects
Sorry for all the questions.
Just want to confirm that you're talking about both Series.__iter__
and EA.__iter__
(e.g. BaseMaskedArray
). I ask because with non-EA, there is a native/non-native type difference between Series.__iter__
and Series.values.__iter__
:
import pandas as pd
ser = pd.Series([1])
for v in ser:
print(type(v)) # -> <class 'int'>
for v in ser.values:
print(type(v)) # -> <class 'numpy.int64'>
Comment From: mroeschke
Series.values
is already a numpy array, so I think there's an understanding that np.array.__iter__
is not yield similar Python native types as Series.__iter__
, so yes I would expect Series.__iter__
and EA.__iter__
to align.
Note I think Series[datetime64/timedelta64]
is the only time where we return pandas objects which make sense due to Timestamp
and Timedelta
objects holding more resolution that datetime.datetime
, `datetime.timedelta.
Comment From: lukemanley
Thanks. One concern with replacing pd.NA
with None
in __iter__
is that code like this will start to break:
import pandas as pd
idx = pd.Index([1, 2, pd.NA], dtype="Int64")
ser = pd.Series(1, index=idx)
for v in ser.index:
print(ser[v]) # -> KeyError: None
or simply:
import pandas as pd
idx = pd.Index([1, 2, pd.NA], dtype="Int64")
[v in idx for v in idx] # -> [True, True, False]
e.g. test_array_iterface
import pandas as pd
import numpy as np
arr = pd.array([1, 2, pd.NA], dtype="Int64")
np.array(arr) == np.array(list(arr)) # -> [True, True False]
The test suite alone has hundreds of failures when replacing pd.NA
with None
in __iter__
. If the test suite is indicative of what users may experience, I suspect this would be a big change and maybe not a desirable one given the examples above. Might there be a case to return native types for non-na values but still return pd.NA
for missing values? That happens to be the behavior of ArrowExtensionArray
at the moment:
import pandas as pd
arr = pd.array([1, pd.NA], dtype='int64[pyarrow]')
for v in arr:
print(type(v))
# <class 'int'>
# <class 'pandas._libs.missing.NAType'>
If you still think __iter__
should return None
I will go through the test suite and see how much actually needs to change to get everything to pass.