Pandas ENH: pd.select_quantiles() -- Select those rows that are at a given quantile

Feature Type

[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

I often find myself in the situation that I not only want to know the minimum and the maximum of a given column, but the row(s) that are at that value. For example, to form sentences that read like "We achieve up to XXX% Improvement (at benchmark foobar)". As I could not directly find an easy solution for this in pandas, I write a small function that achieves that in a configurable manner.

Feature Description

def select_quantiles(df, q=[0.0, 0.5, 1.0], columns=None, compress=True,
                     q_col='q', q_labels=False, column_col='column',
                     value_col='value', cardinality_col='cardinality',
                     index_cols=True, value_cols=True):
    """Select the rows that are at the given quantile (e.g, min, median, max) of the resepective column.

    df              : DataFrame or Series to inspect
    q               : Which quantile rows to report
    q_labels        : Generate human readable labels instead of q values
    columns         : For which columns are the quantiles searched?
    compress        : If multiple rows are at a given quantile, select the first (random) and report the cardinality.
    q_col           : column name for the selected quantile
    column_col      : column name for the inspected column
    value_col       : column name for the quantile value
    cardinality_col : column name for the cardinality
    index_cols      : Are the index columns included in the result?
    value_cols      : Are the other columns included in the result?


    >>> df
                                     uniform   genetic  original_mse
    benchmark icache cachesize CPs
    adpcm_c   False  2         2    0.499697  0.499801  5.084280e+08
                               3    0.666426  0.666493  5.084280e+08
                               4    0.749660  0.749781  5.084280e+08
                               5    0.799851  0.799958  5.084280e+08
                               6    0.832983  0.833108  5.084280e+08
    ...                                  ...       ...           ...
    typeset   True   64        12   0.946788  0.983514  4.774040e+09
                               13   0.955670  0.985238  4.774040e+09
                               14   0.953097  0.986249  4.774040e+09
                               15   0.956851  0.987022  4.774040e+09
                               16   0.962734  0.987791  4.774040e+09

    >>> select_quantiles(df, q=[0,0.25,0.5,0.75,1.0], columns=['uniform', 'genetic'], value_cols=False)
                  cardinality     value   benchmark  icache  cachesize  CPs
    column  q
    uniform 0.00          130  0.000000  rijndael_e   False         64   11
            0.25            1  0.661176      jpeg_d    True          2    3
            0.50            1  0.828055     bitcnts   False          4   16
            0.75            1  0.898309     adpcm_c    True          8   15
            1.00            1  0.973457       pgp_e   False         64   15
    genetic 0.00            1  0.329892      jpeg_d   False         64    2
            0.25            1  0.856994   basicmath   False          8    9
            0.50            1  0.922826      jpeg_d    True          2   13
            0.75            1  0.969853       qsort   False         64   11
            1.00            1  0.999749     bitcnts    True          8   16
    """
    if type(df) is pd.Series:
        df = pd.DataFrame(data=df)
        # For a series it makes no sense to also include the values as
        # this is already in the value_col
        value_cols = False

    # Get those rows from df that are at the quantile q in the column col.
    def get_rows(col, q):
        idx = int(q * (len(df)-1))
        val = df.iloc[idx][col]
        rows = df[df[col] == val].copy()
        if compress:
            cardinality = len(rows)
            rows = rows.iloc[:1]

        index_cols = rows.reset_index()

        r = pd.DataFrame(data=index_cols)
        if q_labels:
            if   q == 0:   label = 'min'
            elif q == 0.5: label = 'median'
            elif q == 1.0: label = 'max'
            else:          label = f'{int(q*100)}%'
            r[q_col] = label
        else:
            r[q_col]      = q
        r[column_col] = col
        r[value_col]  = val

        if compress:
            r[cardinality_col] = cardinality
        return r

    # Get the rows for every given quantile
    ret = []
    columns = columns or df.columns
    for col in columns:
        df = df.sort_values(col)
        for _q in q:
            ret += [get_rows(col, _q)]

    # Concatenate the individual rows
    ret = pd.concat(ret).set_index([column_col, q_col])

    # Select those columns, the user requested
    ordered = [value_col]
    if index_cols:
        if  isinstance(index_cols, Iterable):
            ordered += list(index_cols)
        else:
            ordered += list(df.index.names)
    if value_cols:
        if  isinstance(value_cols, Iterable):
            ordered += list(value_cols)
        else:
            ordered += list(df.columns)
    if compress:
        ordered = [cardinality_col] + ordered

    return ret[ordered]

Alternative Solutions

Hint me to the functionality in Pandas that achieves my goal easily

Additional Context

No response

Comment From: jreback

-1 on adding api

not sure it's worth support for functions like this that are composable

sure maybe these should exist in an addon package but note core pandas

Comment From: MarcoGorelli

thanks for the suggestion

agree that this doesn't need to be in pandas - closing for now then