It would be great if corrwith
could calculate correlations between dataframes that have different column names, but the same index.
For example, take the two (10, 5) df1
and df2
dataframes below.
import pandas as pd
import numpy as np
import string
nrow = 10
ncol = 5
axis = 0
index = list(string.ascii_lowercase[:nrow])
columns = list(string.ascii_uppercase[:ncol])
df1 = pd.DataFrame(np.random.randn(nrow, ncol), index=index, columns=columns)
df2 = pd.DataFrame(np.random.randn(nrow, ncol), index=index)
df1
and df2
have different columns, and I'd like to create a 5x5 matrix of the correlations of their columns, on the values in each row.
I've implemented a stopgap measure here: https://github.com/YeoLab/flotilla/blob/d9e53c219320c5d5dbbbfa41769abb2ab6f25574/flotilla/compute/generic.py#L429
Is this a planned feature for future releases?
Probably also related to the method
issue: https://github.com/pydata/pandas/issues/9490
Comment From: shoyer
I don't think this feature is currently planned, but we're always open to new contributions!
This certainly seems like a useful feature in some form, but I'm not sure it should be squeezed into corrwith
, which currently pairs columns in different dataframes instead of considering the outer product, e.g.,
In [20]: df1.corrwith(df1 ** 1.3)
Out[20]:
A 0.997052
B 0.999656
C 0.994640
D 0.997947
E 0.996514
dtype: float64
This might make more sense as an optional other
argument to DataFrame.corr
, which already returns a matrix of correlations:
In [21]: df1.corr()
Out[21]:
A B C D E
A 1.000000 0.333927 0.174839 -0.202988 0.071950
B 0.333927 1.000000 -0.012221 -0.407869 0.258222
C 0.174839 -0.012221 1.000000 -0.332829 -0.028436
D -0.202988 -0.407869 -0.332829 1.000000 0.434950
E 0.071950 0.258222 -0.028436 0.434950 1.000000
Comment From: jreback
This would be trivial if .set_index()
took an axis argument
In [51]: df1.T.corrwith(df2.T.set_index(df1.columns))
Out[51]:
a 0.072140
b -0.025166
c 0.355389
d 0.461372
e 0.051302
f -0.031054
g 0.761556
h -0.960473
i 0.858428
j 0.159165
dtype: float64
Comment From: shoyer
@jreback that's not quite what @olgabot is looking for, I think -- she wants a 5x5 matrix of cross correlations, if understand her correctly
Comment From: olgabot
@shoyer that's correct - in this example, I want a 5x5 of the cross correls. Generally speaking, if df1
is (n, m) and df2
is (n, p), then I want a (m, p) matrix of correlations
Comment From: olgabot
that's a solution I've used before, but the issue is that it takes a long time if the matrices are large (thousands of columns), and I'm not interested in the df1,df2 correlations, necessarily.
On Mon, Apr 6, 2015 at 4:58 PM jreback notifications@github.com wrote:
In [58]: pd.concat([df1,df2],axis=1).corr() Out[58]: A B C D E 0 1 2 3 4 A 1.000000 0.160876 0.368649 0.252098 0.108760 0.566828 -0.466817 -0.314223 -0.131133 0.590715 B 0.160876 1.000000 0.545311 0.056020 -0.460356 -0.091174 0.354608 -0.378924 -0.003633 -0.048977 C 0.368649 0.545311 1.000000 -0.451701 -0.165686 -0.036762 0.253964 0.036151 0.125651 0.145658 D 0.252098 0.056020 -0.451701 1.000000 -0.257068 0.735342 -0.185773 -0.589945 0.138441 0.006919 E 0.108760 -0.460356 -0.165686 -0.257068 1.000000 0.035644 -0.570018 -0.012733 -0.293044 0.464915 0 0.566828 -0.091174 -0.036762 0.735342 0.035644 1.000000 -0.188574 -0.525456 0.359605 0.347547 1 -0.466817 0.354608 0.253964 -0.185773 -0.570018 -0.188574 1.000000 -0.056621 0.627198 -0.289038 2 -0.314223 -0.378924 0.036151 -0.589945 -0.012733 -0.525456 -0.056621 1.000000 0.041765 -0.069550 3 -0.131133 -0.003633 0.125651 0.138441 -0.293044 0.359605 0.627198 0.041765 1.000000 0.287847 4 0.590715 -0.048977 0.145658 0.006919 0.464915 0.347547 -0.289038 -0.069550 0.287847 1.000000
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/9823#issuecomment-90291610.
Comment From: pstjohn
This is an old issue, but it doesn't seem like pandas has this functionality yet. here's a code sample that works for me, when left
and right
are two dataframes with the same columns but different indices.
def corrwith(left, right):
# demeaned data
left_tiled = np.repeat(left.values[:, np.newaxis, :], right.shape[0], 1)
right_tiled = np.repeat(right.values[np.newaxis, :, :], left.shape[0], 0)
ldem = left_tiled - left_tiled.mean(-1)[:, :, np.newaxis]
rdem = right_tiled - right_tiled.mean(-1)[:, :, np.newaxis]
num = (ldem * rdem).sum(-1)
dom = np.sqrt((ldem**2).sum(-1)) * np.sqrt((rdem**2).sum(-1))
correl = num / dom
return pd.DataFrame(correl, index=left.index, columns=right.index)