Hello,

Problem description

When we create a data frame with pandas ≤ 0.19.2 and pickle it (using pickle.dump), it is not possible to unpickle it using pandas 0.20.1.

# Using pandas 0.19.2
import pandas as pd
import pickle as pkl
data = pd.DataFrame({'x': [1, 2]})
pkl.dump(data, open("data_pd_0.19.2.pkl", "wb"))
# After upgrade to pandas 0.20.1
import pandas as pd
import pickle as pkl
pkl.load(open("data_pd_0.19.2.pkl", "rb"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas.indexes'

First analysis

  • It seems that pandas.indexes has been refactored to pandas.core.indexes.
  • I don't know if there are other such incompatible changes

Proposal

It would be great to have: - A deprecation warning when unpicking old data frame - Load old data frame supported but automatically converted to the new format, so that we can upgrade by pickling the unpickled data frames

Thanks a lot for your help, Best regards.

Comment From: jreback

Big red box, is clear that pd.read_pickle is the pickle reader and makes things backward compatible. Further whatsnew notes have a quite large section on what changed here

sure a direct call will work to pickle.loads, but this is not guaranteeed across versions.

Comment From: matjazk

Going from panda 0.18.1 to 0.20.1 I encountered the same problem when loading with joblib. joblib.load fails with exactly the same error: ImportError: No module named 'pandas.indexes'

When you fix this (see the first workaround), there is an error AttributeError: module 'pandas.core.base' has no attribute 'FrozenNDArray'

After workaround 2, files load. It seems that in my case this is more of a question for joblib devs.

Two (ugly) workarounds:

import sys
# 1
import pandas.core.indexes 
sys.modules['pandas.indexes'] = pandas.core.indexes
# 2
import pandas.core.base, pandas.core.indexes.frozen
setattr(sys.modules['pandas.core.base'],'FrozenNDArray', pandas.core.indexes.frozen.FrozenNDArray)

Comment From: jreback

see the above and simply use pd.read_pickle

Comment From: matjazk

I would if I could. But... I have a complex class (consisting of numpy objects, pandas series and dataframes, dictionaries ...), stored in a compressed joblib archive, so pd.read_pickle is of no use to me. As I said, this might be useful for joblib developers as for now it is impossible to load any joblib archive created when pandas < 0.20. I first had to downgrade pandas and now I'm using the above workarounds.

Comment From: jorisvandenbossche

@matjazk Would you like to open an issue at joblib for this?

Comment From: matjazk

Already did and passed @jreback's suggestion.

Comment From: mhooreman

Thanks. pd.read_pickle works, but, for your information, it is extremely slow - see benchmark. I've made a script to pd.read_pickle and then pd.to_pickle every file. benchmark

Comment From: jorisvandenbossche

@mhooreman the timings of "reading old" look suspiciously consistent with "writing". Are you sure you timed the correct thing?

Comment From: mhooreman

I need to double check, but I'm sure about the performance difference while reading: I got performance issues and I converted to fix those.

Le 8 juin 2017 5:50 PM, "Joris Van den Bossche" notifications@github.com a écrit :

@mhooreman https://github.com/mhooreman the timings of "reading old" look suspiciously consistent with "writing". Are you sure you timed the correct thing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16474#issuecomment-307145401, or mute the thread https://github.com/notifications/unsubscribe-auth/AHCDlsDJNzuoFr8GMwsOIyQwCFGjUEZKks5sCBhTgaJpZM4Nku1h .

Comment From: jreback

@mhooreman of course its slower. its falling back to the python based unpickler which is much more flexible. so you can either have fast or correctness. you get to choose.

Comment From: TheodoreZhao

I got the same problem when unpickling the data in pandas 0.20.2. I have used df.to_pickle() to pickle my dataframe in pandas 0.19.2 but failed to unpickle it using pandas.read_pickle() in pandas 0.20.2. I got the error message

ImportError: No module named 'pandas.indexes'

pandas.read_pickle() and pickle.load() both generate this error message.

Comment From: jorisvandenbossche

@TheodoreZhao If you have this error with read_pickle as well, please open a new issue with a reproducible example.

Comment From: seandickert

@jreback similar to @matjazk, pd.read_pickle doesn't work if you're using pickle.loads to load a string (retrieved from some store other than the filesystem). Can pd.read_pickle be updated to handle a file-like object rather than just a path?

Comment From: jreback

its an open issue: https://github.com/pandas-dev/pandas/issues/5924

if you want to submit a PR to do this, its not difficult.