Pandas Feature Request: Bootstrap sample from DataFrame with MultiIndex

The sample method for DataFrames samples at the row level (by default). Often, one may want to sample at a coarser level, say when the top level of a MultiIndex corresponds to the sampling unit. It's likely that there are a variable number of rows per sampling unit in this case.

There were two promising stackoverflow answers that came close to my use case, but didn't quite fit:

The following assumes a known, deterministic pattern for the MultiIndex. A general solution would allow arbitrary sub-indices. It is based on index values (labels), and runs into issues if there are duplicate indices. https://stackoverflow.com/questions/34890207/sampling-from-multiindex-dataframe

This clever answer unstacks and re-stacks, but it runs into problems with duplicate indices, as well. https://stackoverflow.com/questions/38731858/how-to-get-a-random-bootstrap-sample-from-pandas-multiindex

Below is the work-around I found to work for me. I would be grateful if anyone spots a bug. There could be fancier and faster methods, as well general methods that work for arbitrary level slicing. Would love to see this as a core part of the toolkit!

import itertools
import pandas as pd
import numpy as np

df = pd.DataFrame({'value1': [1.1, 2, 3, 4, 5, 6],
                   'value2': [7.1, 8, 9, 10, 11, 12]},
                  index=pd.MultiIndex.from_arrays(
                    [[1, 1, 1, 2, 2, 3],
                     [10, 10, 30, 50, 50, 60],
                     [0, 1, 2, 3, 3, 5]],
                    names=['group1', 'group2', 'group3']))

# Group by the desired level and get the indices for each group.
grouped = df.groupby(level=0)
index_groups = grouped.indices.values()
ig_sampled = np.random.choice(list(index_groups), len(index_groups))

# Flatten the index groups and select the rows.
new_indices = list(itertools.chain.from_iterable(ig_sampled))
df_sample = df.iloc[new_indices]

# Update index to have unique labels for each sample.
levels_0 = range(len(ig_sampled))
labels_0 = np.repeat(levels_0, [len(c) for c in ig_sampled])
levels = [levels_0] + [levels for levels in df_sample.index.levels[1:]]
labels = [labels_0] + [labels for labels in df_sample.index.labels[1:]]
df_sample.index = pd.MultiIndex(levels, labels, names=df_sample.index.names)

Comment From: mroeschke

Thanks for the suggestion but given the lack of community or core dev interest this isn't likely going to be implement in pandas so closing