Currently there is some inconsistency around the scalar type returned from a Series
aggregation, both in terms of whether it is a numpy or python type, as well as different behavior for an empty Series
- see table below.
Normally this isn't a big deal as the numpy types mostly behave like the python type, but can be an issue with serialization, which is where I ran into this.
Is the desired behavior to make all of these python types?
function | type | type_empty |
---|---|---|
sum | <class 'float'> | <class 'int'> |
mean | <class 'float'> | <class 'float'> |
median | <class 'float'> | <class 'float'> |
count | <class 'numpy.int32'> | <class 'numpy.int32'> |
var | <class 'float'> | <class 'float'> |
std | <class 'float'> | <class 'float'> |
sem | <class 'numpy.float64'> | <class 'numpy.float64'> |
nunique | <class 'int'> | <class 'int'> |
prod | <class 'numpy.float64'> | <class 'float'> |
min | <class 'numpy.float64'> | <class 'float'> |
max | <class 'numpy.float64'> | <class 'float'> |
Code for table
fxs = ['sum', 'mean', 'median', 'count', 'var', 'std', 'sem', 'nunique', 'prod', 'min', 'max']
s = pd.Series([1., 2., 3.])
s_empty = pd.Series([], dtype='f8')
data = []
for f in fxs:
row = dict(function=f)
res = getattr(s, f)()
row['type'] = type(res)
res = getattr(s_empty, f)()
row['type_empty'] = type(res)
data.append(row)
pd.DataFrame(data).to_html(index=False)
Comment From: jreback
yes this would be good to make consistent. scalars should always be python scalars; numpy scalars are a weird hybrid that can have odd effects.
Comment From: chris-b1
Extra test case from #19381
import pandas as pd
import json
import datetime
data = [
datetime.date(1987, 2, 12),
datetime.date(1987, 2, 12),
datetime.date(1987, 2, 12),
None,
None,
datetime.date(1989, 6, 15),
datetime.date(1989, 6, 15),
datetime.date(1989, 6, 15),
datetime.date(1989, 6, 15),
datetime.date(1989, 6, 15),
datetime.date(1989, 6, 15),
datetime.date(1989, 6, 15),
datetime.date(1989, 6, 15),
datetime.date(1989, 6, 15),
datetime.date(1989, 6, 15)
]
df = pd.DataFrame(columns=['foo'])
df['foo'] = data
ds = df['foo'].describe()
d = ds.to_dict()
j = json.dumps(d)
print(j)
Comment From: torlenor
This is still an issue and leads to troubles, as the original poster described, in serialization when packages are used which do not know how to handle numpy types.
Comment From: jreback
@torlenor pull requests for tests and patches are always welcome and how things get fixed