Code Sample, a copy-pastable example if possible
Functionalized example of what I'm seeking to implement in pandas
as a corr
argument or separate function:
import numpy as np
import pandas as pd
def correlate_sort(df: pd.DataFrame, method: str = 'pearson') -> pd.DataFrame:
"""
pd.DataFrame.corr() without redundancy and sorted by strength
"""
df = df.corr(method)
df = df.mask(np.tril(np.ones(df.shape)).astype(np.bool))
df = df.stack().reset_index()
df = df.rename(columns={0:method})
df['sort'] = df[method].abs()
df = df.sort_values('sort', ascending=False)
return df.drop('sort', axis=1).reset_index(drop=True)
Problem description
pd.DataFrame.corr()
returns a table with redundancies. I'm interested in implementing an enhancement (as an argument option or function, etc.) to return a DataFrame without redundancy and sorted by correlation strength.
import pandas as pd
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data'
df = pd.read_csv(data_url, names=['Age', 'op_year', 'pos_nodes', '5_yr_outcome'])
df.corr()
Age op_year pos_nodes 5_yr_outcome
Age 1.000000 0.089529 -0.063176 -0.067950
op_year 0.089529 1.000000 -0.003764 0.004768
pos_nodes -0.063176 -0.003764 1.000000 -0.286768
5_yr_outcome -0.067950 0.004768 -0.286768 1.000000
Expected Output
level_0 level_1 pearson
0 pos_nodes 5_yr_outcome -0.286768
1 Age op_year 0.089529
2 Age 5_yr_outcome -0.067950
3 Age pos_nodes -0.063176
4 op_year 5_yr_outcome 0.004768
5 op_year pos_nodes -0.003764
Output of pd.show_versions()
Comment From: mroeschke
This sounds like more of a usage question. We recommend using StackOverflow for these types of questions.
Issues is reserved for bug tracking and enhancement requests
Comment From: chrisluedtke
@mroeschke I want to make an enhancement. Is this a feature that would be useful?
Comment From: mroeschke
Oh I see. I'll open it back up for discussion. I could see this as a useful cookbook example in our documentation; we are somewhat hessitant to expand pandas' large API without a large interest from the community.
Comment From: jbrockmendel
i agree with @mroeschke; have used this pattern before myself, dont need it implemented in pandas