Code Sample, a copy-pastable example if possible
In [31]: import numpy as np
In [32]: import pandas as pd
In [33]: df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['A', 'B'])
In [34]: df
Out[34]:
A B
0 0 1
1 2 3
2 4 5
In [35]: df.agg(np.std) # Behavior of ddof=0
Out[35]:
A 1.632993
B 1.632993
dtype: float64
In [36]: df.agg([np.std]) # Behavior of ddof=1
Out[36]:
A B
std 2.0 2.0
In [37]: # So how to get the ddof=0 behavior when giving a list of functions?
In [39]: df.agg([lambda x: np.std(x)]) # This gives a numerically unexpected result.
Out[39]:
A B
<lambda> <lambda>
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
In [40]: df.agg([np.mean, lambda x: np.std(x)]) # This gives an error.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-40-52f4ec4195b5> in <module>()
----> 1 df.agg([np.mean, lambda x: np.std(x)])
/Users/ikeda/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
4740 if axis == 0:
4741 try:
-> 4742 result, how = self._aggregate(func, axis=0, *args, **kwargs)
4743 except TypeError:
4744 pass
/Users/ikeda/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
537 return self._aggregate_multiple_funcs(arg,
538 _level=_level,
--> 539 _axis=_axis), None
540 else:
541 result = None
/Users/ikeda/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/pandas/core/base.py in _aggregate_multiple_funcs(self, arg, _level, _axis)
594 # if we are empty
595 if not len(results):
--> 596 raise ValueError("no results")
597
598 try:
ValueError: no results
Problem description
When using, e.g., df.agg
, the ddof
(degrees of freedom) value for the function np.std
changes depending on how the function is given (single function or a list of functions), which may be so confusing for many people. I believe the behavior should be unified in some way.
Furthermore, I could not find the way to obtain to the np.std
result with ddof=0
by supplying it as one of the members of a list of functions. The lambda
expression does not work well in a list of functions (this gives numerically unexpected results or even gives errors). This prohibits us to use many useful methods like df.agg
, df.apply
, and df.describe
when we hope the ddof=0
behavior.
From https://github.com/pandas-dev/pandas/issues/13344, I guess Developers prefer the ddof=1
behavior in pandas. So the expected behavior should be as below.
Expected Output
In [35]: df.agg(np.std) # Behavior of ddof=1
Out[35]:
A 2.0
B 2.0
dtype: float64
In [38]: df.agg([lambda x: np.std(x)]) # To obtain the ddof=0 results
Out[38]:
A B
<lambda> 1.632993 1.632993
In [41]: df.agg([np.mean, lambda x: np.std(x)])
A B
mean 2.0 3.0
<lambda> 1.632993 1.632993
Output of pd.show_versions()
Comment From: mroeschke
This looks to work on master. Could use a test
In [15]: In [35]: df.agg(np.std)
Out[15]:
A 2.0
B 2.0
dtype: float64
Comment From: jcrvz
It works by doing:
df.agg([lambda x: np.mean(x), lambda x: np.std(x, ddof=0)])
Comment From: sappersapper
I encounter a related inconsistence:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['a', 'a', 'b', 'b'], 'value': [1, 3, 5, 7]})
df.groupby('id').agg(np.var) # calc with ddof=1
id | value |
---|---|
a | 2 |
b | 2 |
df.groupby('id').apply(np.var) # calc with ddof=0
id | value |
---|---|
a | 1 |
b | 1 |