With the groupby.var, I don't have the same result with numpy. I use the variance formula, it seems that numpy is correct. In the groupby.var, I think that instead of dividing by N (the number of observations), it is divided by N-1.
# data
df = pd.DataFrame({'id':['aa','aa','aa','aa','dd','dd'],'a':[2,2,3,4,2,2], 'b':[1,2,3,4,5,6]})
# code pandas
df[['id','b']].groupby('id').var()
#[out]:
b
id
aa 1.666667
dd 0.500000
# with numpy
for i in df['id'].drop_duplicates().tolist():
nb_var = np.var(df[df.id==i]['b'])
print(i, nb_var)
#[out]:
aa 1.25
dd 0.25
Comment From: mroeschke
Specified in the docs, the Pandas variance calculation defaults to the unbiased estimate (n-1) while numpy uses the maximum likelihood calculation (n). If you want to calculate the variance like numpy, you'll need to specify ddof=0
in var()
In [5]: df[['id','b']].groupby('id').var(ddof=0)
Out[5]:
b
id
aa 1.25
dd 0.25
Comment From: laurazh
It works. Thanks, I did not understand this option in the doc.