Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from argparse import ArgumentParser
from hashlib import md5
import json
import numpy as np
import pandas as pd
import random


if __name__ == '__main__':

    parser = ArgumentParser(description='Test reproducibility of pandas random_state')
    parser.add_argument('--seed', type=int, default=1234, help='seed for numpy, random, and pandas')
    parser.add_argument('--nrows', type=int, default=2500, help='number of rows in randomly generated data')
    parser.add_argument('--subset', action='store_true', help='filter rows before grouping')
    parser.add_argument('--applied_col', action='store_true', help='run groupby on a new apply column')
    args = parser.parse_args()

    # Generate consistent, random data
    np.random.seed(args.seed)
    random.seed(args.seed)
    df = pd.DataFrame({
        'subset': random.choices(['train', 'test', 'val'], k=args.nrows),
        'text': [''.join([str(i) for i in random.choices(['A', 'B'], k=3)]) for _ in range(args.nrows)],
        'value': np.random.random(args.nrows)})
    df_bytes = json.dumps(df.to_dict(), sort_keys=True, indent=2).encode()
    md5sum = md5(df_bytes).hexdigest()
    print(f'Hash of `df` content = {md5sum}')

    # Testing various setups to figure out which piece causes the bug
    if args.subset:
        df = df[df.subset == 'test'].copy()
    if args.applied_col:
        df['chars'] = df.text.apply(lambda text: tuple(set(sorted(text))))  # Sample requires hashable col
        groupby_col = 'chars'
    else:
        groupby_col = 'text'

    # Perform sampling and inspect results
    sampled = df.groupby(groupby_col).sample(frac=0.1, random_state=args.seed)
    idxs = sampled.index.tolist()
    print(f'Sampled index: sum {sum(idxs):,} | first {idxs[:3]} | last {idxs[-3:]}')

Issue Description

NOTE: you must run the code as a script so that a separate process is invoked each time. Re-running within the same Python shell will not reproduce the bug

Running the script with no arguments produces consistent results. This demonstrates that my randomly generated dataframe is consistent and the random_state is correctly applied. Example results from my machine (I can get identical results for 10+ consecutive runs)

$ python demo.py
Hash of `df` content = 198a37fcc034748378b0b4b1c1c89213
Sampled index: sum 319,444 | first [631, 33, 785] | last [2294, 1092, 1984]

However, if I add the --applied_col flag, the results now alternate between two different sets of sampled indices. The dataframe MD5 sum is still the same which proves we're starting with the same content as before. Example results from my machine below (note you may get several matching runs in a row, but eventually another set of indices will show up)

$ python demo.py --applied_col
Hash of `df` content = 198a37fcc034748378b0b4b1c1c89213
Sampled index: sum 318,327 | first [631, 33, 785] | last [749, 2412, 313]

$ python demo.py --applied_col
Hash of `df` content = 198a37fcc034748378b0b4b1c1c89213
Sampled index: sum 292,969 | first [631, 33, 785] | last [567, 1565, 2190]

Expected Behavior

Changing the groupby_col should not affect the reproducibility of random_state across invoked interpreters. Perhaps there is something unsupported about the way I use df.apply() to generate the chars column?

For those curious about the point of a chars column - in my real dataframe text has more characters than just {A, B} and I want to sample a representative subset with the common character combinations * set() provides the unique characters of each row * tuple() conversion allows it to be hashable for .sample() * sorted() ensures AB and BA get placed in the same group

Installed Versions

INSTALLED VERSIONS ------------------ commit : 91111fd99898d9dcaa6bf6bedb662db4108da6e6 python : 3.9.5.final.0 python-bits : 64 OS : Linux OS-release : 5.15.15-76051515-generic Version : #202201160435~1642693824~20.04~97db1bb~dev-Ubuntu SMP Fri Jan 21 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.1 numpy : 1.23.3 pytz : 2021.3 dateutil : 2.8.1 setuptools : 59.5.0 pip : 22.0.4 Cython : 0.29.30 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.3 jinja2 : None IPython : 8.2.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2022.02.0 gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : 1.3.18 tables : None tabulate : 0.8.9 xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: addisonklinke

TLDR not a Pandas issue, but there are some interesting lessons to learn

Perhaps there is something unsupported about the way I use df.apply() to generate the chars column?

Figured out the root issue here - converting to set() second undoes the sorting since sets are unordered. This means BA could randomly evaluate to either (A, B) or (B, A). Run this code to prove it

for _ in range(100):
    chars = tuple(set(sorted('BA')))
    if chars != ('A', 'B'):
        raise AssertionError(f'Unexpected sorting result: {chars}')

With enough iterations, you'll see AssertionError: Unexpected sorting result: ('B', 'A'). That also explains why there are only ever two distinct sets of sample indices. The fix is simply changing the order by converting to set first and then sorting

df['chars'] = df.text.apply(lambda text: tuple(sorted(set(text))))