Research

  • [X] I have searched the [pandas] tag on StackOverflow for similar questions.

  • [X] I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://codereview.stackexchange.com/questions/284462/python-correlation-function

Question about pandas

I am looking to calculate the correlation between two sets of columns, namely between features and targets. Where, the number of features is well superior to the number of targets. The documented way to calculate corelation is to perform the calculation of the whole correlation matrix, then look at subsets of rows/columns, as shown in the code snippet below. However, as the number of features is big, I perform a lot of unecessary calculations as I will filter most of the coefficients out.

train = pd.read_csv('train.csv')
targets = ['y1','y2']
features = [c for c in train.columns if c not in targets]
train.corr().loc[features,targets]

Is there a more efficient way to do so ? I've tried corr_with too but it only seems to accept DataFrames with same subset of columns.

Comment From: rhshadrach

You'd like to calculate the correlation between each (feature, target) pair, is that correct? So if features are '1', '2', '3', and targets are 'a', 'b', then you want ('a', '1'), ('a', '2'), ('a', '3'), ('b', '1'), ('c', '2'), ('c', '3'). Is that right?

Comment From: lcrmorin

yes, it's right. Ideally in a DataFrame format. I haven't found any practical way to do this in pandas without calculating the others coefficients (('a','a'),('a','b')...).

Comment From: lcrmorin

I have found that the following code works as exepcted:

corr_matrix = pd.concat({target: train[features].corrwith(train[target]) for target in targets}, axis=1)
corr_matrix

However, it is not clear to me that: 1) the internal loop is optimised 2) why is that not the basic behavior of

train[targets].corr_with(train[features])

Comment From: rhshadrach

However, it is not clear to me that:

  1. the internal loop is optimised

Can you profile it against using .corr(...)? They will be the same operation when features and targets consist of all columns in the frame.

This does seem to me to be a natural use of corr, I'd support adding two arguments to specify a subset of columns to take the pairwise correlation (each default to using all columns). However I don't have a good name for these arguments; I think features and targets is too domain-specific.

Comment From: lcrmorin

It seems to me that it would be more appropriate to improve corr_with. Currently corr_with give the expected results with a series (correlation coefficient of each column with said series), but seems to break when used with a dataframe as a parameter. It nows seems more natural to me that:

df1.corr_with(df2)

would give a nxm correlation matrix as

df1.corr_with(s)

gives a nx1 correlation matrix.

This way I could use the straightforward/explicit syntax:

train[features].corr_with(train[targets])

Comment From: lcrmorin

Now that I am thinking more about it, your solution would be welcome too. I use a lot of df.groubpy('time').corr() too. Your solution would work with this usecase too.

Comment From: rhshadrach

It seems to me that it would be more appropriate to improve corr_with.

corrwith is not for computing pairwise correlations on DataFrames. It aligns on both index and columns (when available). The reason why it works for you with a Series is there are no columns to align on. Changing corrwith to not align seems to me to require a large change in it's implementation.

On the other hand, corr is for computing pairwise correlations. It currently iterates over every pair of columns (in Cython) computing the correlation. I think to modify it, all we need to do is specify iterables of columns to use instead.

Comment From: rhshadrach

corrwith is not for computing pairwise correlations on DataFrames

To make this more explicit, this example it's only computing the correlation of 'a' with 'a' and 'b' with 'b':

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
df2 = pd.DataFrame({'a': [2, 2, 4], 'b': [6, 8, 9]})
print(df.corrwith(df2))
# 0    1.000000
# 1    0.981981
# dtype: float64

However, I can see why this is confusing with the behavior when other is Series as mentioned in https://github.com/pandas-dev/pandas/issues/52776#issuecomment-1519131101

Comment From: lcrmorin

Ok. I am not sure to see which use cas it covers (two different DataFrames with same column names - is it common practice ?), but ok.