Hello,
Problem description
When we create a data frame with pandas ≤ 0.19.2 and pickle it (using pickle.dump), it is not possible to unpickle it using pandas 0.20.1.
# Using pandas 0.19.2
import pandas as pd
import pickle as pkl
data = pd.DataFrame({'x': [1, 2]})
pkl.dump(data, open("data_pd_0.19.2.pkl", "wb"))
# After upgrade to pandas 0.20.1
import pandas as pd
import pickle as pkl
pkl.load(open("data_pd_0.19.2.pkl", "rb"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas.indexes'
First analysis
- It seems that
pandas.indexes
has been refactored topandas.core.indexes
. - I don't know if there are other such incompatible changes
Proposal
It would be great to have: - A deprecation warning when unpicking old data frame - Load old data frame supported but automatically converted to the new format, so that we can upgrade by pickling the unpickled data frames
Thanks a lot for your help, Best regards.
Comment From: jreback
Big red box, is clear that pd.read_pickle
is the pickle reader and makes things backward compatible. Further whatsnew notes have a quite large section on what changed here
sure a direct call will work to pickle.loads
, but this is not guaranteeed across versions.
Comment From: matjazk
Going from panda 0.18.1 to 0.20.1 I encountered the same problem when loading with joblib. joblib.load
fails with exactly the same error:
ImportError: No module named 'pandas.indexes'
When you fix this (see the first workaround), there is an error
AttributeError: module 'pandas.core.base' has no attribute 'FrozenNDArray'
After workaround 2, files load. It seems that in my case this is more of a question for joblib devs.
Two (ugly) workarounds:
import sys
# 1
import pandas.core.indexes
sys.modules['pandas.indexes'] = pandas.core.indexes
# 2
import pandas.core.base, pandas.core.indexes.frozen
setattr(sys.modules['pandas.core.base'],'FrozenNDArray', pandas.core.indexes.frozen.FrozenNDArray)
Comment From: jreback
see the above and simply use pd.read_pickle
Comment From: matjazk
I would if I could. But... I have a complex class (consisting of numpy objects, pandas series and dataframes, dictionaries ...), stored in a compressed joblib archive, so pd.read_pickle is of no use to me. As I said, this might be useful for joblib developers as for now it is impossible to load any joblib archive created when pandas < 0.20. I first had to downgrade pandas and now I'm using the above workarounds.
Comment From: jorisvandenbossche
@matjazk Would you like to open an issue at joblib for this?
Comment From: matjazk
Already did and passed @jreback's suggestion.
Comment From: mhooreman
Thanks. pd.read_pickle works, but, for your information, it is extremely slow - see benchmark.
I've made a script to pd.read_pickle and then pd.to_pickle every file.
Comment From: jorisvandenbossche
@mhooreman the timings of "reading old" look suspiciously consistent with "writing". Are you sure you timed the correct thing?
Comment From: mhooreman
I need to double check, but I'm sure about the performance difference while reading: I got performance issues and I converted to fix those.
Le 8 juin 2017 5:50 PM, "Joris Van den Bossche" notifications@github.com a écrit :
@mhooreman https://github.com/mhooreman the timings of "reading old" look suspiciously consistent with "writing". Are you sure you timed the correct thing?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16474#issuecomment-307145401, or mute the thread https://github.com/notifications/unsubscribe-auth/AHCDlsDJNzuoFr8GMwsOIyQwCFGjUEZKks5sCBhTgaJpZM4Nku1h .
Comment From: jreback
@mhooreman of course its slower. its falling back to the python based unpickler which is much more flexible. so you can either have fast or correctness. you get to choose.
Comment From: TheodoreZhao
I got the same problem when unpickling the data in pandas 0.20.2. I have used df.to_pickle()
to pickle my dataframe in pandas 0.19.2 but failed to unpickle it using pandas.read_pickle()
in pandas 0.20.2. I got the error message
ImportError: No module named 'pandas.indexes'
pandas.read_pickle()
and pickle.load()
both generate this error message.
Comment From: jorisvandenbossche
@TheodoreZhao If you have this error with read_pickle
as well, please open a new issue with a reproducible example.
Comment From: seandickert
@jreback similar to @matjazk, pd.read_pickle doesn't work if you're using pickle.loads to load a string (retrieved from some store other than the filesystem). Can pd.read_pickle be updated to handle a file-like object rather than just a path?
Comment From: jreback
its an open issue: https://github.com/pandas-dev/pandas/issues/5924
if you want to submit a PR to do this, its not difficult.