Pandas BUG: to_dict() unexpectedly casts np.float32 to python float

EDIT: After more investigation, I found out the root cause of the issue described below: https://github.com/pandas-dev/pandas/issues/50125#issuecomment-1342886489

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.rand(2,2).astype(np.float32))
print("original float32 data:\n", df.to_dict())
df = df.round(3)
print("to_records() after rounding:\n", df.to_records())
print("to_dict() after rounding:\n", df.to_dict())

Issue Description

The reproducible example returns:

original float32 data:
 {0: {0: 0.3745401203632355, 1: 0.7319939136505127}, 1: {0: 0.9507142901420593, 1: 0.5986585021018982}}
to_records() after rounding:
 [(0, 0.375, 0.951) (1, 0.732, 0.599)]
to_dict() after rounding:
 {0: {0: 0.375, 1: 0.7319999933242798}, 1: {0: 0.9509999752044678, 1: 0.5989999771118164}}

We can see that .round(3) has been properly carried through .to_records() but not through .to_dict(). to_dict() does not return the original data either, but something like the rounding result with extra decimals.

This does not happens if we use np.float64.

(Similar to issue #35124)

Expected Behavior

Result when using np.float64:

original float64 data:
 {0: {0: 0.3745401188473625, 1: 0.7319939418114051}, 1: {0: 0.9507143064099162, 1: 0.5986584841970366}}
to_records() after rounding:
 [(0, 0.375, 0.951) (1, 0.732, 0.599)]
to_dict() after rounding:
 {0: {0: 0.375, 1: 0.732}, 1: {0: 0.951, 1: 0.599}}

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.10.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252 pandas : 1.5.2 numpy : 1.23.5 pytz : 2022.6 dateutil : 2.8.2 setuptools : 65.5.1 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.7.0 pandas_datareader: None bs4 : None bottleneck : None brotli : fastparquet : None fsspec : 2022.11.0 gcsfs : None matplotlib : 3.6.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 9.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.9.3 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : 2022.7

Comment From: ghost

This demonstrates that the problem is somewhere in Serie.item():

serie = pd.Series(np.random.rand(3).astype(np.float32))
serie = serie.round(3)
[(k, v) for k, v in serie.items()]

[(0, 0.29100000858306885), (1, 0.6119999885559082), (2, 0.13899999856948853)]

We can go one level below and find the problem on iter:

list(iter(serie))

[0.29100000858306885, 0.6119999885559082, 0.13899999856948853]

Ruling out numpy as a suspect:

list(iter(np.random.rand(3).astype(np.float32).round(3)))
[0.171, 0.065, 0.949]

Comment From: ghost

Clearly the problem is that iter(serie) cast to float instead of keeping np.float32:

[(v, type(v)) for v in iter(serie)]

[(0.375, float), (0.9509999752044678, float), (0.7319999933242798, float)]

That said, I don't understand where this casting happens.
I looked at serie.__getitem__ which sent me to serie._get_value() but this one behaves properly:

type(serie._get_value(1)), serie._get_value(1)

(numpy.float32, 0.951)

So, what is called by iter, on a Serie?

This really bugs me!

[serie.__getitem__(i) for i in range(3)]

[0.375, 0.951, 0.732]

[i for i in serie]

[0.375, 0.9509999752044678, 0.7319999933242798]

Comment From: ghost

Yes, found it! It's in the parent class base.IndexOpsMixin. There we find the __iter__ which in our case returns:

map(self._values.item, range(self._values.size))

self._values.item() is a Numpy function that cast the np.float32 to float.
Indeed:

serie._values[1], serie._values.item(1)

(0.951, 0.9509999752044678)

To summarise:

Serie.to_dict() calls Serie.items(), which iterates over Serie and that iteration is not done via Serie.__getitem__ but via a parent's function __iter__ which casts our np.float32 to float.

Is it really intended that iter on a Serie calls the __iter__ from base.IndexOpsMixin instead of __getitem__ from Serie?

Comment From: ghost

Closing the issues based on the following conclusions:
- to_dict() must return python's type. - Converting from np.float32 to python's float explicitly show the floating point accuracy limitations, which are "hidden" when working with numpy objects. - Hence a rounded float32 value of 0.951 becomes 0.9509999752044678 when converted to python's float.

Thus, this problem is hardly avoidable, except if we add an argument to to_dict() to optionally return native types.
That could be a new feature (also applicable to some other I/O functions).