Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
data=[1,-1,0,1,3,2,-2,10000000000,1,2,0,-2,1,3,0,1]
df=pd.DataFrame(data,columns=['data'])
df.data.rolling(6).std()[-1:]
15 57.250852
df.data.tail(6).std()
1.6431676725154984
Issue Description
When there is a large outliner in the data, then rolling().std() and tail().std() come to different results.
Expected Behavior
The results must be equal
Installed Versions
Comment From: GYHHAHA
This current momentum rolling method is not able to get high precision when data ranges abnormally. The same to #47461.
Comment From: sappersapper
maybe an ad hoc for rolling().std()
when data ranges abnormally is rolling.agg(lambda x: np.std(x, ddof=1))
:
df.rolling(6).agg(lambda x: np.std(x, ddof=1))[-1:]
15 1.643168
Comment From: mahmoudmarayef
The reason why the rolling().std()
and tail().std()
methods in pandas are returning different results because they are using different window sizes to calculate the standard deviation.
In the case of df.data.rolling(6).std()[-1:]
, a rolling window of size 6 is used to calculate the standard deviation, which includes the outlier value of 10 billion. This causes the standard deviation to be much larger than when the outlier is excluded.
On the other hand, df.data.tail(6).std()
only considers the last 6 values in the DataFrame, excluding the outlier value. This results in a smaller standard deviation compared to when the outlier is included.
To ensure that the results are equal, you can use the same window size for both methods by using df.data.tail(6).rolling(6).std()[-1:]
. This will calculate the standard deviation for the last 6 values of the data frame using a rolling window of size 6, which includes the outlier value.
Alternatively, you can remove the outlier value from the DataFrame before calculating the standard deviation using the drop method, like so:
df_without_outlier = df.drop(df.loc[df['data'] == 10000000000].index)
df_without_outlier.rolling(6).std()[-1:]
This will remove the outlier value from the DataFrame and then calculate the standard deviation using a rolling window of size 6, resulting in a smaller standard deviation value that is more in line with the tail().std()
method.