Pandas corr return without duplicates and sorted by correlation strength

Code Sample, a copy-pastable example if possible

Functionalized example of what I'm seeking to implement in pandas as a corr argument or separate function:

import numpy as np
import pandas as pd

def correlate_sort(df: pd.DataFrame, method: str = 'pearson') -> pd.DataFrame:
  """
  pd.DataFrame.corr() without redundancy and sorted by strength
  """
  df = df.corr(method)
  df = df.mask(np.tril(np.ones(df.shape)).astype(np.bool))
  df = df.stack().reset_index()
  df = df.rename(columns={0:method})

  df['sort'] = df[method].abs()
  df = df.sort_values('sort', ascending=False)

  return df.drop('sort', axis=1).reset_index(drop=True)

Problem description

pd.DataFrame.corr() returns a table with redundancies. I'm interested in implementing an enhancement (as an argument option or function, etc.) to return a DataFrame without redundancy and sorted by correlation strength.

import pandas as pd

data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data'
df = pd.read_csv(data_url, names=['Age', 'op_year', 'pos_nodes', '5_yr_outcome'])
df.corr()

                   Age   op_year  pos_nodes  5_yr_outcome
Age           1.000000  0.089529  -0.063176     -0.067950
op_year       0.089529  1.000000  -0.003764      0.004768
pos_nodes    -0.063176 -0.003764   1.000000     -0.286768
5_yr_outcome -0.067950  0.004768  -0.286768      1.000000

Expected Output

     level_0       level_1   pearson
0  pos_nodes  5_yr_outcome -0.286768
1        Age       op_year  0.089529
2        Age  5_yr_outcome -0.067950
3        Age     pos_nodes -0.063176
4    op_year  5_yr_outcome  0.004768
5    op_year     pos_nodes -0.003764

Output of `pd.show_versions()`

/usr/local/lib/python3.6/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: . """) INSTALLED VERSIONS ------------------ commit: None python: 3.6.7.final.0 python-bits: 64 OS: Linux OS-release: 4.14.79+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.22.0 pytest: 3.10.1 pip: 18.1 setuptools: 40.6.3 Cython: 0.29.2 numpy: 1.14.6 scipy: 1.1.0 pyarrow: None xarray: 0.11.2 IPython: 5.5.0 sphinx: 1.8.3 patsy: 0.5.1 dateutil: 2.5.3 pytz: 2018.9 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 2.1.2 openpyxl: 2.5.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.6 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.15 pymysql: None psycopg2: 2.7.6.1 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: 0.2.0 fastparquet: None pandas_gbq: 0.4.1 pandas_datareader: 0.7.0

Comment From: mroeschke

This sounds like more of a usage question. We recommend using StackOverflow for these types of questions.

Issues is reserved for bug tracking and enhancement requests

Comment From: chrisluedtke

@mroeschke I want to make an enhancement. Is this a feature that would be useful?

Comment From: mroeschke

Oh I see. I'll open it back up for discussion. I could see this as a useful cookbook example in our documentation; we are somewhat hessitant to expand pandas' large API without a large interest from the community.

Comment From: jbrockmendel

i agree with @mroeschke; have used this pattern before myself, dont need it implemented in pandas

Pandas corr return without duplicates and sorted by correlation strength

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`