We try to consistently return python objects (instead of numpy scalars) in certain functions like tolist, to_dict, itertuples/items, .. (we have had quite some issues fixing this in several cases).

However, currently we don't do that for extension dtypes (and don't have any mechanism to ask for this):

In [33]: type(pd.Series([1, 2], dtype='int64').tolist()[0]) 
Out[33]: int

In [34]: type(pd.Series([1, 2], dtype='Int64').tolist()[0])  
Out[34]: numpy.int64

In [36]: type(pd.Series([1, 2], dtype='int64').to_dict()[0]) 
Out[36]: int

In [37]: type(pd.Series([1, 2], dtype='Int64').to_dict()[0])
Out[37]: numpy.int64

In [45]: s = pd.Series([1, 2], dtype='int64') 

In [46]: type(list(s.iteritems())[0][1])  
Out[46]: int

In [47]: s = pd.Series([1, 2], dtype='Int64')      

In [48]: type(list(s.iteritems())[0][1])  
Out[48]: numpy.int64

Should we add some API to ExtensionArray to provide this? Eg a method to iterate through the elements that returns "native" objects?

Comment From: jorisvandenbossche

Actually, for Series, the __iter__ also returns native types, so maybe fixing that is enough: ensure that __iter__ on IntegerArray etc do return the python objects (so by not plainly using __getitem__, which correctly returns numpy scalars):

In [57]: s = pd.Series([1, 2], dtype='int64')  

In [58]: type(s[0])  
Out[58]: numpy.int64

In [59]: type(list(iter(s))[0])  
Out[59]: int

In [60]: s = pd.Series([1, 2], dtype='Int64')         

In [61]: type(s[0])    
Out[61]: numpy.int64

In [62]: type(list(iter(s))[0])  
Out[62]: numpy.int64   # <-- fixing this might fix the other cases?

Comment From: jorisvandenbossche

Let's consider this a bug in __iter__ then, because I think all mentioned cases can be solved with (for IntegerArray):

--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -456,7 +456,7 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
             if self._mask[i]:
                 yield self.dtype.na_value
             else:
-                yield self._data[i]
+                yield self._data[i].item()

Comment From: marco-neumann-by

So ExtensionArray.__iter__ should return native types then? Is this also true for ExtensionArray.__getitem__ with some integer? I think it should at least be documented then so that other (external) implementations can get this right.

Comment From: jorisvandenbossche

Is this also true for ExtensionArray.getitem with some integer?

If we mimic what Series with plain numpy dtype does, then getitem should keep returning the numpy scalar.

Comment From: victorbr92

Hi, @marco-neumann-jdas and @jorisvandenbossche, are you guys working on this issue ? Could I help somehow and make a PR for it (it was marked as a good first issue). I tested the changes proposed by @jorisvandenbossche locally and it fixes the reported issue indeed.

Comment From: marco-neumann-by

I am not working on it. Keep in mind that this issue is not only about fixing the behavior of IntegerArray but also to adjust the docs of ExtensionArray.__iter__ to state that EVERY EA should return Python-native types.

Comment From: mroeschke

Similarly, should pd.NA (ExtensionDtype.na_value) be converted to None as the native equivalent?

In [1]: pd.Series([1, 2, None], dtype="Int64").tolist()
Out[1]: [1, 2, <NA>] # should <NA> be None?

Comment From: Peque

Stumbled upon this issue when trying to serialize a DataFrame resulted from .isocalendar():

import json

from pandas import date_range

df = date_range("2021-01-01", freq="D", periods=7).isocalendar()
json.dumps(df.to_dict(orient="list"))
# TypeError: Object of type uint32 is not JSON serializable

Of course, this can be reduced to an example like those presented above by @jorisvandenbossche:

type(pd.Series([1, 2], dtype='UInt32').tolist()[0])
# numpy.int32

Just leaving this comment in case someone looks for "isocalendar" or "not JSON serializable".

Comment From: lukemanley

@Peque - your example cases have been fixed on the main branch and will be included in 2.0. On main, your first example works without raising and your second example returns int.

Comment From: lukemanley

Similarly, should pd.NA (ExtensionDtype.na_value) be converted to None as the native equivalent?

@mroeschke - was there ever an answer to this? pd.NA is still being returned

Comment From: Peque

@lukemanley Great to know and thanks for sharing! :blush:

Comment From: mroeschke

@mroeschke - was there ever an answer to this? pd.NA is still being returned

Personally I think it makes sense to convert pd.NA to None as the Python native type. @jorisvandenbossche might have thoughts on this as well

Comment From: lukemanley

50796 changed `Series.to_dict` to return `None` in place of `pd.NA` which seems to be the preferred behavior as it is consistent with returning native types. I took a look at doing the same for `tolist` and `iter`.

I'll note that changing the value for __iter__ would result in the following list comprehension to start raising:

arr = pd.array([1, 2, pd.NA], dtype="Int64")

[v+1 for v in arr]  # <- TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

I'm curious if that is too big of a change? Since series.tolist() == list(series) is tested behavior I suspect tolist and __iter__ need to remain consistent with each other.

Any suggestions for moving this forward from here? Its probably not ideal that to_dict and tolist are inconsistent with pd.NA at the moment.

cc @phofl @mroeschke

Comment From: lukemanley

One more comment here. I think there may be a case for making tolist return None in place of pd.NA but not __iter__. This would make tolist consistent with to_dict. It would also be consistent with how numpy and pyarrow operate. Both numpy and pyarrow have .tolist methods that return python native types and __iter__ methods that return non-native types when iterating through an array.

If we were to take this approach, the currently tested series.tolist() == list(series) behavior would change.

Example with pyarrow:

In [1]: arr = pa.array([1, None])

In [2]: arr.tolist()
Out[2]: [1, None]

In [3]: list(arr)
Out[3]: [<pyarrow.Int64Scalar: 1>, <pyarrow.Int64Scalar: None>]

Comment From: mroeschke

My initial feeling is that both __iter__ and list should both return python types with justification that if a user calls list I think they would expect the elements to behave like native Python objects

Comment From: lukemanley

My initial feeling is that both iter and list should both return python types with justification that if a user calls list I think they would expect the elements to behave like native Python objects

Sorry for all the questions.

Just want to confirm that you're talking about both Series.__iter__ and EA.__iter__ (e.g. BaseMaskedArray). I ask because with non-EA, there is a native/non-native type difference between Series.__iter__ and Series.values.__iter__:

import pandas as pd

ser  = pd.Series([1])

for v in ser:
    print(type(v))      # -> <class 'int'>

for v in ser.values:
    print(type(v))      # -> <class 'numpy.int64'>

Comment From: mroeschke

Series.values is already a numpy array, so I think there's an understanding that np.array.__iter__ is not yield similar Python native types as Series.__iter__, so yes I would expect Series.__iter__ and EA.__iter__ to align.

Note I think Series[datetime64/timedelta64] is the only time where we return pandas objects which make sense due to Timestamp and Timedelta objects holding more resolution that datetime.datetime, `datetime.timedelta.

Comment From: lukemanley

Thanks. One concern with replacing pd.NA with None in __iter__ is that code like this will start to break:

import pandas as pd

idx = pd.Index([1, 2, pd.NA], dtype="Int64")
ser = pd.Series(1, index=idx)

for v in ser.index:
    print(ser[v])       # -> KeyError: None

or simply:

import pandas as pd

idx = pd.Index([1, 2, pd.NA], dtype="Int64")

[v in idx for v in idx]    # -> [True, True, False]

e.g. test_array_iterface

import pandas as pd
import numpy as np

arr = pd.array([1, 2, pd.NA], dtype="Int64")

np.array(arr) == np.array(list(arr))    # -> [True, True False]

The test suite alone has hundreds of failures when replacing pd.NA with None in __iter__. If the test suite is indicative of what users may experience, I suspect this would be a big change and maybe not a desirable one given the examples above. Might there be a case to return native types for non-na values but still return pd.NA for missing values? That happens to be the behavior of ArrowExtensionArray at the moment:

import pandas as pd

arr = pd.array([1, pd.NA], dtype='int64[pyarrow]')

for v in arr:
    print(type(v))

# <class 'int'>
# <class 'pandas._libs.missing.NAType'>

If you still think __iter__ should return None I will go through the test suite and see how much actually needs to change to get everything to pass.

Pandas API: ExtensionArrays and conversion to "native" types (eg in tolist, to_dict, iteration, ..)

50796 changed Series.to_dict to return None in place of pd.NA which seems to be the preferred behavior as it is consistent with returning native types. I took a look at doing the same for tolist and __iter__.

50796 changed `Series.to_dict` to return `None` in place of `pd.NA` which seems to be the preferred behavior as it is consistent with returning native types. I took a look at doing the same for `tolist` and `iter`.