Note that for a constant series, Series.skew()
returns 0
, while rolling/expanding_skew()
return NaN
.
In [532]: s = Series([1]*4)
In [533]: s.skew()
Out[533]: 0
In [534]: expanding_skew(s)
Out[534]:
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
In [535]: rolling_skew(s, 4)
Out[535]:
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
While rolling/expanding_kurt()
similarly return NaN
for a constant series, Series.kurt()
produces an error.
In [541]: expanding_kurt(s)
Out[541]:
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
In [542]: rolling_kurt(s, 4)
Out[542]:
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
In [543]: s.kurt()
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-543-e41f2ec435bd> in <module>()
----> 1 s.kurt()
C:\Python34\lib\site-packages\pandas\core\generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
3783 skipna=skipna)
3784 return self._reduce(f, axis=axis,
-> 3785 skipna=skipna, numeric_only=numeric_only)
3786 stat_func.__name__ = name
3787 return stat_func
C:\Python34\lib\site-packages\pandas\core\series.py in _reduce(self, op, axis, skipna, numeric_only, filter_type, **kwds)
2007 filter_type=None, **kwds):
2008 """ perform a reduction operation """
-> 2009 return op(_values_from_object(self), skipna=skipna, **kwds)
2010
2011 def _reindex_indexer(self, new_index, indexer, copy):
C:\Python34\lib\site-packages\pandas\core\nanops.py in _f(*args, **kwargs)
46 'this dtype'.format(f.__name__.replace('nan',
47 '')))
---> 48 return f(*args, **kwargs)
49 return _f
50
C:\Python34\lib\site-packages\pandas\core\nanops.py in nankurt(values, axis, skipna)
504 D = _zero_out_fperr(D)
505
--> 506 result = (((count * count - 1.) * D / (B * B) - 3 * ((count - 1.) ** 2)) /
507 ((count - 2.) * (count - 3.)))
508 if isinstance(result, np.ndarray):
ZeroDivisionError: float division by zero
Comment From: seth-p
My 2c: there really shouldn't be two different implementations of skew/kurtosis functions.
Comment From: jreback
@seth-p maybe edit the top section with a code-reference so that when this is addressed the section can be fixed (as you have commented for #7926)
Comment From: seth-p
@jreback, afraid I'm not sure I follow your last comment. Do you mean the following?
As a test for a fix, un-comment the following lines in test_moments.py
:
#(mom.expanding_skew, lambda v: Series(v).skew(), 3), # restore once GH 8086 is fixed
#(mom.expanding_kurt, lambda v: Series(v).kurt(), 4), # restore once GH 8086 is fixed
#(mom.rolling_skew, lambda v: Series(v).skew(), 3), # restore once GH 8086 is fixed
#(mom.rolling_kurt, lambda v: Series(v).kurt(), 4), # restore once GH 8086 is fixed
Comment From: jreback
@seth-p I realize now why you commented these out
Comment From: jaimefrio
The solution to this is to modify the behavior of Series.skew()
and Series.kurt()
so that they return np.nan
instead of 0
, right?
Comment From: seth-p
I think so, assuming they are supposed to return unbiased estimates. Though to be honest I haven't examined them closely beyond observing the inconsistencies noted above.
Comment From: jaimefrio
I was looking into this to remove the commented tests for the rolling versions. There are a bunch of tests making sure that the return value is 0, and the code itself explicitly sets things to 0. So even if it is a bug, it was clearly considered a feature when the code was written. Not sure whether changing the behavior can be happily done without breaking someone's code...
Comment From: jreback
relatd to this: https://github.com/pydata/pandas/pull/7928
I wouldn't return nan
unless you have all missing values or its empty. 0 is the calculated value no?
Comment From: seth-p
I think, ideally, the commented-out consistency tests (https://github.com/pydata/pandas/issues/8086#issuecomment-55687801) should be there. Obviously some of the cases are 0
/ NaN
, but others aren't.
Separately, I think should add to _test_moments_consistency()
additional parameters skew_unbiased=None, kurt_unbiased=None
and skew_biased=None, kurt_biased=None
, and consistency checks between them and the corresponding lower-order moments (biased or unbiased, as appropriate), i.e. comparable to this test for biased variance estimates:
if var is var_biased:
# check that biased var(x) == mean(x^2) - mean(x)^2
mean_x2 = mean(x * x)
assert_equal(var_x, mean_x2 - (mean_x * mean_x))
Comment From: seth-p
Regarding 0
vs NaN
, even if for now we are keeping the disparate behavior ofrolling/expanding_kurt/skew()
and Series.skew/kurt()
, I would leave those commented-out tests there (still commented out), as an aspirational goal...
Comment From: seth-p
@jreback, if need to divide by a denominator that is 0, then I think should return NaN
(analogous to calculating an unbiased standard deviation where need to divide by N-1
when N=1
).
Comment From: jreback
@seth-p yep that too! I believe var/std do that as well shouldn't be filled by 0 by default (though is ATM)
Comment From: seth-p
@jreback, no, at least in master, calculating unbiased (ddof=1
, the default) var/std will produce NaN
for a single non-NaN
value (since it wants to divide by N-1 = 0
). This should be the case for rolling/expanding_var()
, ewmvar()
(with bias=False
, the default), and Series.var()
. If that's not the case, and any of these produces 0
for a single non-NaN
value, then I missed something in my tests...
Comment From: jreback
@seth-p what I mean is that var/std
DO provide a nan
when dividing by 0. These should all be consistent (hence this issue).
Comment From: seth-p
Ah, sorry, I misunderstood you. Yes, all the var/std
functions do produce a NaN
when dividing by 0...
Comment From: jreback
as far as the tests go, yes I think will have to change those tests and test for nan in those edge cases.
Comment From: TomAugspurger
In [133]: s.rolling(4).kurt()
Out[133]:
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
In [134]: s.kurt()
Out[134]: 0