Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from argparse import ArgumentParser
from hashlib import md5
import json
import numpy as np
import pandas as pd
import random
if __name__ == '__main__':
parser = ArgumentParser(description='Test reproducibility of pandas random_state')
parser.add_argument('--seed', type=int, default=1234, help='seed for numpy, random, and pandas')
parser.add_argument('--nrows', type=int, default=2500, help='number of rows in randomly generated data')
parser.add_argument('--subset', action='store_true', help='filter rows before grouping')
parser.add_argument('--applied_col', action='store_true', help='run groupby on a new apply column')
args = parser.parse_args()
# Generate consistent, random data
np.random.seed(args.seed)
random.seed(args.seed)
df = pd.DataFrame({
'subset': random.choices(['train', 'test', 'val'], k=args.nrows),
'text': [''.join([str(i) for i in random.choices(['A', 'B'], k=3)]) for _ in range(args.nrows)],
'value': np.random.random(args.nrows)})
df_bytes = json.dumps(df.to_dict(), sort_keys=True, indent=2).encode()
md5sum = md5(df_bytes).hexdigest()
print(f'Hash of `df` content = {md5sum}')
# Testing various setups to figure out which piece causes the bug
if args.subset:
df = df[df.subset == 'test'].copy()
if args.applied_col:
df['chars'] = df.text.apply(lambda text: tuple(set(sorted(text)))) # Sample requires hashable col
groupby_col = 'chars'
else:
groupby_col = 'text'
# Perform sampling and inspect results
sampled = df.groupby(groupby_col).sample(frac=0.1, random_state=args.seed)
idxs = sampled.index.tolist()
print(f'Sampled index: sum {sum(idxs):,} | first {idxs[:3]} | last {idxs[-3:]}')
Issue Description
NOTE: you must run the code as a script so that a separate process is invoked each time. Re-running within the same Python shell will not reproduce the bug
Running the script with no arguments produces consistent results. This demonstrates that my randomly generated dataframe is consistent and the random_state
is correctly applied. Example results from my machine (I can get identical results for 10+ consecutive runs)
$ python demo.py
Hash of `df` content = 198a37fcc034748378b0b4b1c1c89213
Sampled index: sum 319,444 | first [631, 33, 785] | last [2294, 1092, 1984]
However, if I add the --applied_col
flag, the results now alternate between two different sets of sampled indices. The dataframe MD5 sum is still the same which proves we're starting with the same content as before. Example results from my machine below (note you may get several matching runs in a row, but eventually another set of indices will show up)
$ python demo.py --applied_col
Hash of `df` content = 198a37fcc034748378b0b4b1c1c89213
Sampled index: sum 318,327 | first [631, 33, 785] | last [749, 2412, 313]
$ python demo.py --applied_col
Hash of `df` content = 198a37fcc034748378b0b4b1c1c89213
Sampled index: sum 292,969 | first [631, 33, 785] | last [567, 1565, 2190]
Expected Behavior
Changing the groupby_col
should not affect the reproducibility of random_state
across invoked interpreters. Perhaps there is something unsupported about the way I use df.apply()
to generate the chars
column?
For those curious about the point of a chars
column - in my real dataframe text
has more characters than just {A, B}
and I want to sample a representative subset with the common character combinations
* set()
provides the unique characters of each row
* tuple()
conversion allows it to be hashable for .sample()
* sorted()
ensures AB
and BA
get placed in the same group
Installed Versions
Comment From: addisonklinke
TLDR not a Pandas issue, but there are some interesting lessons to learn
Perhaps there is something unsupported about the way I use df.apply() to generate the chars column?
Figured out the root issue here - converting to set()
second undoes the sorting since sets are unordered. This means BA
could randomly evaluate to either (A, B)
or (B, A)
. Run this code to prove it
for _ in range(100):
chars = tuple(set(sorted('BA')))
if chars != ('A', 'B'):
raise AssertionError(f'Unexpected sorting result: {chars}')
With enough iterations, you'll see AssertionError: Unexpected sorting result: ('B', 'A')
. That also explains why there are only ever two distinct sets of sample indices. The fix is simply changing the order by converting to set first and then sorting
df['chars'] = df.text.apply(lambda text: tuple(sorted(set(text))))