We try to consistently return python objects (instead of numpy scalars) in certain functions like tolist, to_dict, itertuples/items, .. (we have had quite some issues fixing this in several cases).
However, currently we don't do that for extension dtypes (and don't have any mechanism to ask for this):
In [33]: type(pd.Series([1, 2], dtype='int64').tolist()[0])
Out[33]: int
In [34]: type(pd.Series([1, 2], dtype='Int64').tolist()[0])
Out[34]: numpy.int64
In [36]: type(pd.Series([1, 2], dtype='int64').to_dict()[0])
Out[36]: int
In [37]: type(pd.Series([1, 2], dtype='Int64').to_dict()[0])
Out[37]: numpy.int64
In [45]: s = pd.Series([1, 2], dtype='int64')
In [46]: type(list(s.iteritems())[0][1])
Out[46]: int
In [47]: s = pd.Series([1, 2], dtype='Int64')
In [48]: type(list(s.iteritems())[0][1])
Out[48]: numpy.int64
Should we add some API to ExtensionArray to provide this? Eg a method to iterate through the elements that returns "native" objects?
Comment From: jorisvandenbossche
Actually, for Series, the __iter__ also returns native types, so maybe fixing that is enough: ensure that __iter__ on IntegerArray etc do return the python objects (so by not plainly using __getitem__, which correctly returns numpy scalars):
In [57]: s = pd.Series([1, 2], dtype='int64')
In [58]: type(s[0])
Out[58]: numpy.int64
In [59]: type(list(iter(s))[0])
Out[59]: int
In [60]: s = pd.Series([1, 2], dtype='Int64')
In [61]: type(s[0])
Out[61]: numpy.int64
In [62]: type(list(iter(s))[0])
Out[62]: numpy.int64 # <-- fixing this might fix the other cases?
Comment From: jorisvandenbossche
Let's consider this a bug in __iter__ then, because I think all mentioned cases can be solved with (for IntegerArray):
--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -456,7 +456,7 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
if self._mask[i]:
yield self.dtype.na_value
else:
- yield self._data[i]
+ yield self._data[i].item()
Comment From: marco-neumann-by
So ExtensionArray.__iter__ should return native types then? Is this also true for ExtensionArray.__getitem__ with some integer? I think it should at least be documented then so that other (external) implementations can get this right.
Comment From: jorisvandenbossche
Is this also true for ExtensionArray.getitem with some integer?
If we mimic what Series with plain numpy dtype does, then getitem should keep returning the numpy scalar.
Comment From: victorbr92
Hi, @marco-neumann-jdas and @jorisvandenbossche, are you guys working on this issue ? Could I help somehow and make a PR for it (it was marked as a good first issue). I tested the changes proposed by @jorisvandenbossche locally and it fixes the reported issue indeed.
Comment From: marco-neumann-by
I am not working on it.
Keep in mind that this issue is not only about fixing the behavior of IntegerArray but also to adjust the docs of ExtensionArray.__iter__ to state that EVERY EA should return Python-native types.
Comment From: mroeschke
Similarly, should pd.NA (ExtensionDtype.na_value) be converted to None as the native equivalent?
In [1]: pd.Series([1, 2, None], dtype="Int64").tolist()
Out[1]: [1, 2, <NA>] # should <NA> be None?
Comment From: Peque
Stumbled upon this issue when trying to serialize a DataFrame resulted from .isocalendar():
import json
from pandas import date_range
df = date_range("2021-01-01", freq="D", periods=7).isocalendar()
json.dumps(df.to_dict(orient="list"))
# TypeError: Object of type uint32 is not JSON serializable
Of course, this can be reduced to an example like those presented above by @jorisvandenbossche:
type(pd.Series([1, 2], dtype='UInt32').tolist()[0])
# numpy.int32
Just leaving this comment in case someone looks for "isocalendar" or "not JSON serializable".
Comment From: lukemanley
@Peque - your example cases have been fixed on the main branch and will be included in 2.0. On main, your first example works without raising and your second example returns int.
Comment From: lukemanley
Similarly, should pd.NA (ExtensionDtype.na_value) be converted to None as the native equivalent?
@mroeschke - was there ever an answer to this? pd.NA is still being returned
Comment From: Peque
@lukemanley Great to know and thanks for sharing! :blush:
Comment From: mroeschke
@mroeschke - was there ever an answer to this? pd.NA is still being returned
Personally I think it makes sense to convert pd.NA to None as the Python native type. @jorisvandenbossche might have thoughts on this as well
Comment From: lukemanley
50796 changed Series.to_dict to return None in place of pd.NA which seems to be the preferred behavior as it is consistent with returning native types. I took a look at doing the same for tolist and __iter__.
I'll note that changing the value for __iter__ would result in the following list comprehension to start raising:
arr = pd.array([1, 2, pd.NA], dtype="Int64")
[v+1 for v in arr] # <- TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
I'm curious if that is too big of a change? Since series.tolist() == list(series) is tested behavior I suspect tolist and __iter__ need to remain consistent with each other.
Any suggestions for moving this forward from here? Its probably not ideal that to_dict and tolist are inconsistent with pd.NA at the moment.
cc @phofl @mroeschke
Comment From: lukemanley
One more comment here. I think there may be a case for making tolist return None in place of pd.NA but not __iter__. This would make tolist consistent with to_dict. It would also be consistent with how numpy and pyarrow operate. Both numpy and pyarrow have .tolist methods that return python native types and __iter__ methods that return non-native types when iterating through an array.
If we were to take this approach, the currently tested series.tolist() == list(series) behavior would change.
Example with pyarrow:
In [1]: arr = pa.array([1, None])
In [2]: arr.tolist()
Out[2]: [1, None]
In [3]: list(arr)
Out[3]: [<pyarrow.Int64Scalar: 1>, <pyarrow.Int64Scalar: None>]
Comment From: mroeschke
My initial feeling is that both __iter__ and list should both return python types with justification that if a user calls list I think they would expect the elements to behave like native Python objects
Comment From: lukemanley
My initial feeling is that both iter and list should both return python types with justification that if a user calls list I think they would expect the elements to behave like native Python objects
Sorry for all the questions.
Just want to confirm that you're talking about both Series.__iter__ and EA.__iter__ (e.g. BaseMaskedArray). I ask because with non-EA, there is a native/non-native type difference between Series.__iter__ and Series.values.__iter__:
import pandas as pd
ser = pd.Series([1])
for v in ser:
print(type(v)) # -> <class 'int'>
for v in ser.values:
print(type(v)) # -> <class 'numpy.int64'>
Comment From: mroeschke
Series.values is already a numpy array, so I think there's an understanding that np.array.__iter__ is not yield similar Python native types as Series.__iter__, so yes I would expect Series.__iter__ and EA.__iter__ to align.
Note I think Series[datetime64/timedelta64] is the only time where we return pandas objects which make sense due to Timestamp and Timedelta objects holding more resolution that datetime.datetime, `datetime.timedelta.
Comment From: lukemanley
Thanks. One concern with replacing pd.NA with None in __iter__ is that code like this will start to break:
import pandas as pd
idx = pd.Index([1, 2, pd.NA], dtype="Int64")
ser = pd.Series(1, index=idx)
for v in ser.index:
print(ser[v]) # -> KeyError: None
or simply:
import pandas as pd
idx = pd.Index([1, 2, pd.NA], dtype="Int64")
[v in idx for v in idx] # -> [True, True, False]
e.g. test_array_iterface
import pandas as pd
import numpy as np
arr = pd.array([1, 2, pd.NA], dtype="Int64")
np.array(arr) == np.array(list(arr)) # -> [True, True False]
The test suite alone has hundreds of failures when replacing pd.NA with None in __iter__. If the test suite is indicative of what users may experience, I suspect this would be a big change and maybe not a desirable one given the examples above. Might there be a case to return native types for non-na values but still return pd.NA for missing values? That happens to be the behavior of ArrowExtensionArray at the moment:
import pandas as pd
arr = pd.array([1, pd.NA], dtype='int64[pyarrow]')
for v in arr:
print(type(v))
# <class 'int'>
# <class 'pandas._libs.missing.NAType'>
If you still think __iter__ should return None I will go through the test suite and see how much actually needs to change to get everything to pass.