Pandas ENH: Please add immutable/hashable FrozenDataFrame

Is your feature request related to a problem?

I would like to use (algorithm, DataFrame) tuples as dict keys to record results of various experiments on various datasets, but while we can make immutable algorithms, we cannot make immutable DataFrames.

There are a bajillion benefits we miss by not enforcing immutability. For example in parallelism, we need to pickle data to share, which requires hashing. Caching often requires hash comparison to be performant (functools.lru_cache) Dict keys must be hashable. Immutable data programs are easier to reason about. The code can be simpler which allows faster easier development. Many bugs can be prevented, etc etc.

Describe the solution you'd like

It's a big ask, but a FrozenDataFrame (or whatever name the maintainers like for it) class would be handy because it could be hashable, and updates could return new FrozenDataFrame with pointers to updated data. Zero Copy operations would be preferred since this would be faster and save memory.

API breaking implications

FrozenDataFrame can be a subclass of DataFrame and thus this would be an addition, so there wouldn't be any requirement to change previous stuff. Perhaps the subclass can override methods which mutate data, with new methods which raise helpful error messages, or return new pointers to new FrozenDataFrame instances if possible.

Perhaps down the road if the API were stabilized and well tested, this new subclass could replace the pd.DataFrame in a major version change, if the maintainers wanted.

Describe alternatives you've considered

There are hacks around this but it seems like something Pandas community would use. It seems the Pandas community already favors immutability in a lot of places. immutables.Map can hold google/jax arrays but this loses a LOT of useful features.

Additional context

algo: str = 'string representation of an algorithm'
candles: pd.DataFrame = get_candles("SPY", 5, "min")

key: tuple = (algo, candles)
value: float =  backtest(algo, candles)
results: immutables.Map = immutables.Map()

results_with_backtest = R.assoc(key, value, results)

~/miniconda3/envs/py38/lib/python3.8/site-packages/pandas/core/generic.py in hash(self) 1666 1667 def hash(self): -> 1668 raise TypeError( 1669 f"{repr(type(self).name)} objects are mutable, " 1670 f"thus they cannot be hashed"

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

Comment From: phofl

This was briefly discussed in #16567

Comment From: jreback

you can already hash data frames: https://pandas.pydata.org/docs/reference/api/pandas.util.hash_pandas_object.html?highlight=hash#pandas.util.hash_pandas_object

this is used in other libraries as well (dask)

Comment From: mroeschke

Thanks for the suggestion. Given that we provide hashing functionality and there's already an active library with this functionality https://github.com/InvestmentSystems/static-frame, I would also be -1 for adding this directly to pandas.

Closing, but happy to reopen if the team reconsiders this enhancement.

Comment From: alexlenail

I wasn't able to use @lru_cache with a function which takes dataframes as arguments and found this issue.

What's the reasoning for having a hash function at pandas.util.hash_pandas_object but not df.__hash__() (which work natively with std lib things like @lru_cache)?

Comment From: TomAugspurger

__hash__ requires that the objects be immutable. If we implemented that on a mutable data structure, all sorts of things would break (including lru_cache: if the dataframe were mutated you'd potentially get incorrect results).

Comment From: alexlenail

Thanks @TomAugspurger. Is there a way to use @lru_cache with a function that operates on dataframes?

Comment From: TomAugspurger

No, not directly. You would need to somehow transform the dataframe into something that is hashable before it got to the cache (maybe with another decorator that calls pandas’ hashing methods? I’m not sure).

Tom

On Jan 17, 2023 at 5:46:24 PM, Alexander Lenail @.***> wrote:

Thanks @TomAugspurger https://github.com/TomAugspurger. Is there a way to use @lru_cache with a function that operates on dataframes?

— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/39346#issuecomment-1386248628, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIUSODUAMQXE4B2CX6DWS4VNBANCNFSM4WPT4DIA . You are receiving this because you were mentioned.Message ID: @.***>