Feature Type

[X] Adding new functionality to pandas
[X] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

When dealing with unbalanced large amount of data, I wanted to subsample training data according to its corresponding propotion of label column.

For example, if female acounts for 90% and male accounts for 10 % of the whole datasets of size 100,000 rows, I wish pandas can provide an API to subsample 10,000 rows with 90% of female and 10% of male.

I found that the similar functions in pandas only provide subsampling with groupby and specified number(or fraction) of data such as issue #33777, but it is slightly different from the API I proposed above.

Feature Description

Propotionate stratified subsampling implemented as follows: First we will calculate the propotion of each group in df[column_name], stratified subsample according to this propotion and return the concatenated dataframe.

import pandas as pd

def propotionate_sampling(df, column_name, sample_num):
    '''
    Propotionate Stratified Samping: sampling according to the propotion of data of genre
    Parameters
    ----------
    df: input pandas dataframe
    column_name: string, the column we want to groupby
    sample_num: int, number of output subsample
    '''
    sampling_num = dict()
    sample_df = pd.DataFrame()

    total_entry = float(len(df))
    count_lst = [int(sample_num * (count / total_entry)) for count in df[column_name].value_counts().tolist()]
    name_lst = df[column_name].value_counts().index.tolist()
    for genre_name, genre_count in zip(name_lst, count_lst): sampling_num[genre_name] = genre_count

    for group, gb_df in df.groupby(column_name):
        sample_df = pd.concat([sample_df, gb_df.sample(n=sampling_num[group])])

    return sample_df

Alternative Solutions

None

Additional Context

240 and #33777 are both related to stratrified sampling, but they are not talking about the propotionate (or propotional) subsampling

I've been dealing with unbalanced data in this semester and I found that propotionate stratified sampling helps me better in sampling from training data than merely stratified sampling a fixed amount of data in each group because I still wanted to keep the unbalance feature in my sample. So, I think that perhaps it would be better for pandas to add this feature.

Reference1: https://www.investopedia.com/terms/stratified_random_sampling.asp Reference2: https://towardsdatascience.com/stratified-sampling-and-how-to-perform-it-in-r-8b753efde1ef Gist:

Comment From: rhshadrach

Thanks for the request. Can you clarify how this different from:

df.groupby(column_name).sample(frac=sample_num/len(df))

Comment From: sandy273040

I've tested the two pieces of code and found that they yielded almost the same result. So, it would be better to write the code you provide than to add a new function to pandas. I didn't find the way to do propotionate stratified sampling in such a precise way. Thank you for your response.

Pandas ENH: Propotionate stratified subsampling