EDIT: After more investigation, I found out the root cause of the issue described below: https://github.com/pandas-dev/pandas/issues/50125#issuecomment-1342886489
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(2,2).astype(np.float32))
print("original float32 data:\n", df.to_dict())
df = df.round(3)
print("to_records() after rounding:\n", df.to_records())
print("to_dict() after rounding:\n", df.to_dict())
Issue Description
The reproducible example returns:
original float32 data:
{0: {0: 0.3745401203632355, 1: 0.7319939136505127}, 1: {0: 0.9507142901420593, 1: 0.5986585021018982}}
to_records() after rounding:
[(0, 0.375, 0.951) (1, 0.732, 0.599)]
to_dict() after rounding:
{0: {0: 0.375, 1: 0.7319999933242798}, 1: {0: 0.9509999752044678, 1: 0.5989999771118164}}
We can see that .round(3)
has been properly carried through .to_records()
but not through .to_dict()
.
to_dict()
does not return the original data either, but something like the rounding result with extra decimals.
This does not happens if we use np.float64
.
(Similar to issue #35124)
Expected Behavior
Result when using np.float64
:
original float64 data:
{0: {0: 0.3745401188473625, 1: 0.7319939418114051}, 1: {0: 0.9507143064099162, 1: 0.5986584841970366}}
to_records() after rounding:
[(0, 0.375, 0.951) (1, 0.732, 0.599)]
to_dict() after rounding:
{0: {0: 0.375, 1: 0.732}, 1: {0: 0.951, 1: 0.599}}
Installed Versions
Comment From: ghost
This demonstrates that the problem is somewhere in Serie.item()
:
serie = pd.Series(np.random.rand(3).astype(np.float32))
serie = serie.round(3)
[(k, v) for k, v in serie.items()]
[(0, 0.29100000858306885), (1, 0.6119999885559082), (2, 0.13899999856948853)]
We can go one level below and find the problem on iter
:
list(iter(serie))
[0.29100000858306885, 0.6119999885559082, 0.13899999856948853]
Ruling out numpy as a suspect:
list(iter(np.random.rand(3).astype(np.float32).round(3)))
[0.171, 0.065, 0.949]
Comment From: ghost
Clearly the problem is that iter(serie)
cast to float
instead of keeping np.float32
:
[(v, type(v)) for v in iter(serie)]
[(0.375, float), (0.9509999752044678, float), (0.7319999933242798, float)]
That said, I don't understand where this casting happens.
I looked at serie.__getitem__
which sent me to serie._get_value()
but this one behaves properly:
type(serie._get_value(1)), serie._get_value(1)
(numpy.float32, 0.951)
So, what is called by iter
, on a Serie
?
This really bugs me!
[serie.__getitem__(i) for i in range(3)]
[0.375, 0.951, 0.732]
[i for i in serie]
[0.375, 0.9509999752044678, 0.7319999933242798]
Comment From: ghost
Yes, found it!
It's in the parent class base.IndexOpsMixin
. There we find the __iter__
which in our case returns:
map(self._values.item, range(self._values.size))
self._values.item()
is a Numpy function that cast the np.float32
to float
.
Indeed:
serie._values[1], serie._values.item(1)
(0.951, 0.9509999752044678)
To summarise:
Serie.to_dict()
calls Serie.items()
, which iterates over Serie
and that iteration is not done via Serie.__getitem__
but via a parent's function __iter__
which casts our np.float32
to float
.
Is it really intended that iter
on a Serie
calls the __iter__
from base.IndexOpsMixin
instead of __getitem__
from Serie
?
Comment From: ghost
Closing the issues based on the following conclusions:
- to_dict()
must return python's type.
- Converting from np.float32
to python's float
explicitly show the floating point accuracy limitations, which are "hidden" when working with numpy objects.
- Hence a rounded float32 value of 0.951
becomes 0.9509999752044678
when converted to python's float.
Thus, this problem is hardly avoidable, except if we add an argument to to_dict()
to optionally return native types.
That could be a new feature (also applicable to some other I/O functions).