Pandas Corrwith when "other" has different columns

It would be great if corrwith could calculate correlations between dataframes that have different column names, but the same index.

For example, take the two (10, 5) df1 and df2 dataframes below.

import pandas as pd
import numpy as np
import string

nrow = 10
ncol = 5

axis = 0

index = list(string.ascii_lowercase[:nrow])
columns = list(string.ascii_uppercase[:ncol])

df1 = pd.DataFrame(np.random.randn(nrow, ncol), index=index, columns=columns)
df2 = pd.DataFrame(np.random.randn(nrow, ncol), index=index)

df1 and df2 have different columns, and I'd like to create a 5x5 matrix of the correlations of their columns, on the values in each row.

I've implemented a stopgap measure here: https://github.com/YeoLab/flotilla/blob/d9e53c219320c5d5dbbbfa41769abb2ab6f25574/flotilla/compute/generic.py#L429

Is this a planned feature for future releases?

Probably also related to the method issue: https://github.com/pydata/pandas/issues/9490

Comment From: shoyer

I don't think this feature is currently planned, but we're always open to new contributions!

This certainly seems like a useful feature in some form, but I'm not sure it should be squeezed into corrwith, which currently pairs columns in different dataframes instead of considering the outer product, e.g.,

In [20]: df1.corrwith(df1 ** 1.3)
Out[20]:
A    0.997052
B    0.999656
C    0.994640
D    0.997947
E    0.996514
dtype: float64

This might make more sense as an optional other argument to DataFrame.corr, which already returns a matrix of correlations:

In [21]: df1.corr()
Out[21]:
          A         B         C         D         E
A  1.000000  0.333927  0.174839 -0.202988  0.071950
B  0.333927  1.000000 -0.012221 -0.407869  0.258222
C  0.174839 -0.012221  1.000000 -0.332829 -0.028436
D -0.202988 -0.407869 -0.332829  1.000000  0.434950
E  0.071950  0.258222 -0.028436  0.434950  1.000000

Comment From: jreback

This would be trivial if .set_index() took an axis argument

In [51]: df1.T.corrwith(df2.T.set_index(df1.columns))  
Out[51]: 
a    0.072140
b   -0.025166
c    0.355389
d    0.461372
e    0.051302
f   -0.031054
g    0.761556
h   -0.960473
i    0.858428
j    0.159165
dtype: float64

Comment From: shoyer

@jreback that's not quite what @olgabot is looking for, I think -- she wants a 5x5 matrix of cross correlations, if understand her correctly

Comment From: olgabot

@shoyer that's correct - in this example, I want a 5x5 of the cross correls. Generally speaking, if df1 is (n, m) and df2 is (n, p), then I want a (m, p) matrix of correlations

Comment From: olgabot

that's a solution I've used before, but the issue is that it takes a long time if the matrices are large (thousands of columns), and I'm not interested in the df1,df2 correlations, necessarily.

On Mon, Apr 6, 2015 at 4:58 PM jreback notifications@github.com wrote:

In [58]: pd.concat([df1,df2],axis=1).corr() Out[58]: A B C D E 0 1 2 3 4 A 1.000000 0.160876 0.368649 0.252098 0.108760 0.566828 -0.466817 -0.314223 -0.131133 0.590715 B 0.160876 1.000000 0.545311 0.056020 -0.460356 -0.091174 0.354608 -0.378924 -0.003633 -0.048977 C 0.368649 0.545311 1.000000 -0.451701 -0.165686 -0.036762 0.253964 0.036151 0.125651 0.145658 D 0.252098 0.056020 -0.451701 1.000000 -0.257068 0.735342 -0.185773 -0.589945 0.138441 0.006919 E 0.108760 -0.460356 -0.165686 -0.257068 1.000000 0.035644 -0.570018 -0.012733 -0.293044 0.464915 0 0.566828 -0.091174 -0.036762 0.735342 0.035644 1.000000 -0.188574 -0.525456 0.359605 0.347547 1 -0.466817 0.354608 0.253964 -0.185773 -0.570018 -0.188574 1.000000 -0.056621 0.627198 -0.289038 2 -0.314223 -0.378924 0.036151 -0.589945 -0.012733 -0.525456 -0.056621 1.000000 0.041765 -0.069550 3 -0.131133 -0.003633 0.125651 0.138441 -0.293044 0.359605 0.627198 0.041765 1.000000 0.287847 4 0.590715 -0.048977 0.145658 0.006919 0.464915 0.347547 -0.289038 -0.069550 0.287847 1.000000

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/9823#issuecomment-90291610.

Comment From: pstjohn

This is an old issue, but it doesn't seem like pandas has this functionality yet. here's a code sample that works for me, when left and right are two dataframes with the same columns but different indices.

def corrwith(left, right):
    # demeaned data
    left_tiled = np.repeat(left.values[:, np.newaxis, :], right.shape[0], 1)
    right_tiled = np.repeat(right.values[np.newaxis, :, :], left.shape[0], 0)

    ldem = left_tiled - left_tiled.mean(-1)[:, :, np.newaxis]
    rdem = right_tiled - right_tiled.mean(-1)[:, :, np.newaxis]

    num = (ldem * rdem).sum(-1)

    dom = np.sqrt((ldem**2).sum(-1)) * np.sqrt((rdem**2).sum(-1))
    correl = num / dom

    return pd.DataFrame(correl, index=left.index, columns=right.index)