Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>> data = pd.DataFrame(
[
['a', 1, decimal.Decimal('1')],
['a', 2, decimal.Decimal('2')],
['b', 3, decimal.Decimal('3')],
['b', 4, decimal.Decimal('4')]
], columns=['dim', 'float', 'decimal']
)
>>> data
dim float decimal
0 a 1 1
1 a 2 2
2 b 3 3
3 b 4 4
>>> # for pd.Series
>>> type(data['float'].mean())
numpy.float64
>>> type(data['decimal'].mean())
numpy.float64
>>> type(data['decimal'].sum() / len(data['decimal']))
decimal.Decimal
>>> # for DataFrame
>>> type(data.sum()['decimal'])
decimal.Decimal
>>> type(data.mean()['decimal'])
numpy.float64
Issue Description
When the mean function calculates decimal types, it produces a floating-point result, whereas the expected result should be of decimal type.
Expected Behavior
type(data['decimal'].mean()) decimal.Decimal type(data.mean()['decimal']) decimal.Decimal
Installed Versions
Comment From: evgepab
take
Comment From: eirinikafourou
take
Comment From: haiyashah
take
Comment From: haiyashah
This is a corrected version of that code. We can create a custom function that converts the output of numpy.mean() to decimal.
Code:
import numpy as np
import decimal
def mean_decimal(arr):
return decimal.Decimal(str(np.mean(arr)))
#Example usage
data = pd.DataFrame(
[
['a', 1, decimal.Decimal('1')],
['a', 2, decimal.Decimal('2')],
['b', 3, decimal.Decimal('3')],
['b', 4, decimal.Decimal('4')]
], columns=['dim', 'float', 'decimal']
)
#For pd.Series
print(type(data['float'].mean())) # numpy.float64
print(type(mean_decimal(data['decimal']))) # decimal.Decimal
#For DataFrame
print(type(data.mean()['decimal'])) # numpy.float64
print(type(data.sum()['decimal'])) # decimal.Decimal
Explanation:
This function mean_decimal() takes an array as input and returns the mean value as a decimal.Decimal type. We convert the mean value to a string using str() and then pass it to the decimal.Decimal() constructor to ensure it's of the correct type.
We can then use this function to calculate the mean of the 'decimal' column in our DataFrame and ensure it's of type decimal.Decimal as follows:
Function:
print(type(mean_decimal(data['decimal']))) #decimal.Decimal
Comment From: rhshadrach
We currently do the computation (via NumPy object dtype) as Decimal and then convert the result to float. So if we choose to return Decimal here, I think we can do so without a performance regression.
Comment From: jbrockmendel
I'd suggest using dtype = pd.ArrowDtype(pa.decimal128(32))
here (32 is an arbitrary choice here)