Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
s1 = pd.Series({"a": np.int64(64), "b": 10})
for v in s1.to_dict().values():
print(type(v)) # prints <class 'int'> 2x
s2 = pd.Series({"a": np.int64(64), "b": 10, "c": "ABC"})
for v in s2.to_dict().values():
print(type(v)) # prints <class 'numpy.int64'> for first variable "a"
for k, v in s1.items():
print(k, type(v)) # prints <class 'int'> 2x
for k, v in s2.items():
print(k, type(v)) # prints <class 'numpy.int64'> again for the first variable "a"
Problem description
pd.Series.to_dict
can return different types for objects depending on the composition of the series. This also affects iteration, e.g., for k, v in series: ...
. This is inconsistent and, critically, leads to really weird and hard to debug issues downstream with types, especially around JSON conversion (the built-in json
module and many others will blow up when it encounters numpy dtypes).
I cannot find this exact issue open in the issue tracker, though there are a number of related issues including:
* An issue related to DataFrame.to_dict
and inconsistent types (closed in 0.24
): https://github.com/pandas-dev/pandas/issues/24908
* This issue also related to scalar coercion on DataFrame.to_dict
calls (also closed recently): https://github.com/pandas-dev/pandas/issues/23753
* This PR fixes the issue deriving from iteration, but it looks like the above case is either an untested edge case or a regression: https://github.com/pandas-dev/pandas/pull/17491
Expected Output
Expected output is for type coercion to Python ints to occur regardless of the exact column composition in the Series. https://github.com/pandas-dev/pandas/issues/24908 is a related issue for DataFrame
coercions with irregular behavior happening as a result.
Output of pd.show_versions()
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Comment From: mroeschke
This probably occurs because s2
is object dtype and it's trying to preserve the dtype of each input argument while the arguments in s1
can both be coerced to int64
.
Investigation and PR's welcome~
Comment From: drew-heenan
I'm having a go at this issue - quick note @boydgreenfield, it looks like iterating over a Series
object as in the last two loops in your example results in an iteration only over the int
values in the Series
. Did you mean to iterate over s1.items()
or similar?
Comment From: boydgreenfield
@drew-heenan Yes you're right I meant .items()
. Have updated the above code snippet. Thanks for taking a look at the issue!
Comment From: simonjayhawkins
from https://github.com/pandas-dev/pandas/pull/37648#issue-516125361
This resolves the issue of return types from to_dict. #25969 also discusses return types from .items(), which relates to an outstanding NumPy issue numpy/numpy#14139, and I don't address that part here atm
Comment From: ghost
Some findings about the root cause of this casting issue on Series.items()
: https://github.com/pandas-dev/pandas/issues/50125#issuecomment-1342886489