Code Sample, a copy-pastable example if possible
# Your code here
import pandas as pd
a = pd.Series([1e5, 0, 0, 0, 0])
b = pd.Series([9.45] * 5)
c1 = a.rolling(5).corr(b).iloc[4]
c2 = a.corr(b)
v1 = a.rolling(5).cov(b).iloc[4]
v2 = a.cov(b)
assert c1 == c2
assert v1 == v2
Problem description
I came across a strange behavior of Pandas rolling correlation. In the code snippet below, I'd assume v1 == v2 is true but it turns out not. This causes inf in rolling correlation (c1 vs. c2, where c2 is fine but c1 is "wrong" in my opinion). Since the standard deviation of a constant sequence is 0, the correlation between it and any other sequence would be a 0/0. Returning a nan as what the vanilla corr does is fine, but returning inf is annoying and misleading
Expected Output
assertions pass
Output of pd.show_versions()
Comment From: navicor90
Hi, I was reading the implementation of the cov function and at the end is:
(mean(X * Y) - mean(X) * mean(Y))
where X=a
and Y=b
in your situation.
You need to know that in cov function a
and b
series are casted as type float64.
And as the flaoting point python3 implementation has some limitations, the subtraction between mean(XY) - mean(X)mean(Y) is not exactly zero.
I thought that this other implementation avoid this kind of situations:
(summarize(X * Y) - (n * mean(X) * mean(Y))) * bias_adj
I tested it and it returned zero.
Comment From: jreback
yep would take a patch for this
Comment From: navicor90
I was trying to implement this, but I found many other situations where (summarize(X * Y) - (n * mean(X) * mean(Y))) * bias_adj
still not return exactly the expected value.
Why this happen? Because we still have fractions (mean(X) and mean(Y)
) I looked for another ways to do the math, but always you need to preserve at least one fraction. So eventually the imprecision happens.
As workaround, the user could use the round,floor or ceil functions.