Code Sample, a copy-pastable example if possible
import pandas as pd
df = pd.read_hdf('data.h5')
Problem description
The HDF5 dataset was created with pandas
, to_hdf
in Python 2.7 and can be read in by Python 2.7. When I try to read it in with Python 3.5 or Python 3.6, I get the following:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-2-53006689fd2c> in <module>()
----> 1 df = pd.read_hdf(data.h5')
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
356 'contains multiple datasets.')
357 key = candidate_only_group._v_pathname
--> 358 return store.select(key, auto_close=auto_close, **kwargs)
359 except:
360 # if there is an error, close the store
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
720 chunksize=chunksize, auto_close=auto_close)
721
--> 722 return it.get_result()
723
724 def select_as_coordinates(
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
1426
1427 # directly return the result
-> 1428 results = self.func(self.start, self.stop, where)
1429 self.close()
1430 return results
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
713 return s.read(start=_start, stop=_stop,
714 where=_where,
--> 715 columns=columns, **kwargs)
716
717 # create the iterator
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, start, stop, **kwargs)
2864 blk_items = self.read_index('block%d_items' % i)
2865 values = self.read_array('block%d_values' % i,
-> 2866 start=_start, stop=_stop)
2867 blk = make_block(values,
2868 placement=items.get_indexer(blk_items))
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
2413 import tables
2414 node = getattr(self.group, key)
-> 2415 data = node[start:stop]
2416 attrs = node._v_attrs
2417
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
673 start, stop, step = self._process_range(
674 key.start, key.stop, key.step)
--> 675 return self.read(start, stop, step)
676 # Try with a boolean or point selection
677 elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
813 atom = self.atom
814 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 815 outlistarr = [atom.fromarray(arr) for arr in listarr]
816 else:
817 # Convert the list to the right flavor
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
813 atom = self.atom
814 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 815 outlistarr = [atom.fromarray(arr) for arr in listarr]
816 else:
817 # Convert the list to the right flavor
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
1226 if array.size == 0:
1227 return None
-> 1228 return six.moves.cPickle.loads(array.tostring())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)
Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!
Note: Many problems can be resolved by simply upgrading pandas
to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master
addresses this issue, but that is not necessary.
For documentation-related issues, you can check the latest versions of the docs on master
here:
https://pandas-docs.github.io/pandas-docs-travis/
If the issue has not been resolved there, go ahead and file it in the issue tracker.
Expected Output
In [1]: import pandas as pd
In [2]: df = pd.read_hdf('data.h5')
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
Comment From: gfyoung
@zoof : Thanks for reporting this. Strange that it's in this order and not vice-versa (support for unicode is much better in Python 3.x than in Python 2.x).
I see that you are using 0.20.1
. Just for reference, can you try upgrading and see if that changes anything?
@jreback : I seem to be recalling a previous issue similar to this. Am I right about that or not?
Comment From: zoof
Updated to 0.20.3:
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
Comment From: jreback
show what u wrote and how
Comment From: zoof
Sorry, basically the same as before:
In [1]: import pandas as pd
In [2]: pd.read_hdf('data.h5')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-2-21d8820a6af9> in <module>()
----> 1 pd.read_hdf('data.h5')
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs)
370 'contains multiple datasets.')
371 key = candidate_only_group._v_pathname
--> 372 return store.select(key, auto_close=auto_close, **kwargs)
373 except:
374 # if there is an error, close the store
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
740 chunksize=chunksize, auto_close=auto_close)
741
--> 742 return it.get_result()
743
744 def select_as_coordinates(
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
1447
1448 # directly return the result
-> 1449 results = self.func(self.start, self.stop, where)
1450 self.close()
1451 return results
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
733 return s.read(start=_start, stop=_stop,
734 where=_where,
--> 735 columns=columns, **kwargs)
736
737 # create the iterator
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, start, stop, **kwargs)
2885 blk_items = self.read_index('block%d_items' % i)
2886 values = self.read_array('block%d_values' % i,
-> 2887 start=_start, stop=_stop)
2888 blk = make_block(values,
2889 placement=items.get_indexer(blk_items))
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
2434 import tables
2435 node = getattr(self.group, key)
-> 2436 data = node[start:stop]
2437 attrs = node._v_attrs
2438
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
673 start, stop, step = self._process_range(
674 key.start, key.stop, key.step)
--> 675 return self.read(start, stop, step)
676 # Try with a boolean or point selection
677 elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
813 atom = self.atom
814 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 815 outlistarr = [atom.fromarray(arr) for arr in listarr]
816 else:
817 # Convert the list to the right flavor
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
813 atom = self.atom
814 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 815 outlistarr = [atom.fromarray(arr) for arr in listarr]
816 else:
817 # Convert the list to the right flavor
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
1226 if array.size == 0:
1227 return None
-> 1228 return six.moves.cPickle.loads(array.tostring())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)
Comment From: jreback
you are not answering the question; show an example of WRITING
Comment From: zoof
I guess you want a sample dataset? I extracted a small sample from the troublesome series in the large HDF file: https://ufile.io/l94bf. This file too works with Python 2.7 but fails with Python 3.x.
Comment From: jreback
you need to show a complete example that includes writing and reading
Comment From: zoof
Like this? The data in each instance is the same, just different sources.
In [3]: pd.DataFrame(['Executive Director of HR',
'Assistant Director of HR',
'Instructor Chair of Paramedics',
'Proctor Testing Center \xe2\x80\x93 PT',
'Instructor \xe2\x80\x93 Welding (Automotive)',
'Lab Tech \xe2\x80\x93 Automotive \xe2\x80\x93 PT',
'Lab Tech Technology \xe2\x80\x93 PT',
'Maintenance Tech',
'Business Services Coordinator',
'Scheduler']).to_hdf('data.h5','data')
In [4]: pd.read_hdf('data.h5')
Out[4]:
0
0 Executive Director of HR
1 Assistant Director of HR
2 Instructor Chair of Paramedics
3 Proctor Testing Center â PT
4 Instructor â Welding (Automotive)
5 Lab Tech â Automotive â PT
6 Lab Tech Technology â PT
7 Maintenance Tech
8 Business Services Coordinator
9 Scheduler
In [5]: pd.read_hdf('data16.h5')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-5-2a895d140f15> in <module>()
----> 1 pd.read_hdf('data16.h5')
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs)
370 'contains multiple datasets.')
371 key = candidate_only_group._v_pathname
--> 372 return store.select(key, auto_close=auto_close, **kwargs)
373 except:
374 # if there is an error, close the store
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
740 chunksize=chunksize, auto_close=auto_close)
741
--> 742 return it.get_result()
743
744 def select_as_coordinates(
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
1447
1448 # directly return the result
-> 1449 results = self.func(self.start, self.stop, where)
1450 self.close()
1451 return results
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
733 return s.read(start=_start, stop=_stop,
734 where=_where,
--> 735 columns=columns, **kwargs)
736
737 # create the iterator
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, **kwargs)
2751 kwargs = self.validate_read(kwargs)
2752 index = self.read_index('index', **kwargs)
-> 2753 values = self.read_array('values', **kwargs)
2754 return Series(values, index=index, name=self.name)
2755
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
2434 import tables
2435 node = getattr(self.group, key)
-> 2436 data = node[start:stop]
2437 attrs = node._v_attrs
2438
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
673 start, stop, step = self._process_range(
674 key.start, key.stop, key.step)
--> 675 return self.read(start, stop, step)
676 # Try with a boolean or point selection
677 elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
813 atom = self.atom
814 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 815 outlistarr = [atom.fromarray(arr) for arr in listarr]
816 else:
817 # Convert the list to the right flavor
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
813 atom = self.atom
814 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 815 outlistarr = [atom.fromarray(arr) for arr in listarr]
816 else:
817 # Convert the list to the right flavor
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
1226 if array.size == 0:
1227 return None
-> 1228 return six.moves.cPickle.loads(array.tostring())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)
Comment From: zoof
Use 2.7
pd.DataFrame(['Executive Director of HR',
'Assistant Director of HR',
'Instructor Chair of Paramedics',
'Proctor Testing Center \xe2\x80\x93 PT',
'Instructor \xe2\x80\x93 Welding (Automotive)',
'Lab Tech \xe2\x80\x93 Automotive \xe2\x80\x93 PT',
'Lab Tech Technology \xe2\x80\x93 PT',
'Maintenance Tech',
'Business Services Coordinator',
'Scheduler']).to_hdf('data.h5','data')
Try to read in 3.6
pd.read_hdf('data.h5')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-2-8ce48fe594b7> in <module>()
----> 1 pd.read_hdf('data.h5')
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs)
370 'contains multiple datasets.')
371 key = candidate_only_group._v_pathname
--> 372 return store.select(key, auto_close=auto_close, **kwargs)
373 except:
374 # if there is an error, close the store
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
740 chunksize=chunksize, auto_close=auto_close)
741
--> 742 return it.get_result()
743
744 def select_as_coordinates(
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
1447
1448 # directly return the result
-> 1449 results = self.func(self.start, self.stop, where)
1450 self.close()
1451 return results
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
733 return s.read(start=_start, stop=_stop,
734 where=_where,
--> 735 columns=columns, **kwargs)
736
737 # create the iterator
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, start, stop, **kwargs)
2885 blk_items = self.read_index('block%d_items' % i)
2886 values = self.read_array('block%d_values' % i,
-> 2887 start=_start, stop=_stop)
2888 blk = make_block(values,
2889 placement=items.get_indexer(blk_items))
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
2434 import tables
2435 node = getattr(self.group, key)
-> 2436 data = node[start:stop]
2437 attrs = node._v_attrs
2438
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
673 start, stop, step = self._process_range(
674 key.start, key.stop, key.step)
--> 675 return self.read(start, stop, step)
676 # Try with a boolean or point selection
677 elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
813 atom = self.atom
814 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 815 outlistarr = [atom.fromarray(arr) for arr in listarr]
816 else:
817 # Convert the list to the right flavor
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
813 atom = self.atom
814 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 815 outlistarr = [atom.fromarray(arr) for arr in listarr]
816 else:
817 # Convert the list to the right flavor
/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
1226 if array.size == 0:
1227 return None
-> 1228 return six.moves.cPickle.loads(array.tostring())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)
Comment From: jreback
This is not supported for fixed stores, try using format='table'
when you save in 2.7
Comment From: jreback
you can also see https://github.com/pandas-dev/pandas/issues/11126, and try passing encoding='utf-8'
in 2.7.
Comment From: zoof
The former works but the latter does not. I don't see why this is not a bug though since 2.7 can read the file produced without format='table'
but 3.x cannot.
Comment From: jreback
it is simply not supported but the underllhing infrastructure (e.g. in PyTables).
Comment From: zoof
Just a postscript. format='table'
only works for a single column of data. When trying to save the entire dataset in Python 2.7,
TypeError: Cannot serialize the column [task_list] because
its data contents are [unicode] object dtype
when saving using encoding='utf-8'
the file is saved but again cannot be read in 3.x. TypeError: lookup() argument must be str, not numpy.bytes_
Comment From: asanakoy
I have the same issue. Why did you decide not to fix it?
Comment From: asanakoy
As a workaround I'm currently converting my python2.7 dataframes in JSON and then read them using python3.6.
# Run this in py2.7
#####################
import pandas as pd
# read dataframe in py2.7
path = 'df.hdf5' # path to dataframe saved in py2.7
df = pd.read_hdf(path)
json_string = pd.to_json(compression='gzip')
with open('df.json.gz', 'w') as fp:
fp.write(json_string)
#####################
# Now run in py3.6
#####################
import pandas as pd
with open('df.json.gz', 'r') as fp:
json_string = fp.read()
df = pd.read_json(json_string)
Comment From: envhyf
Just a postscript.
format='table'
only works for a single column of data. When trying to save the entire dataset in Python 2.7,
TypeError: Cannot serialize the column [task_list] because its data contents are [unicode] object dtype
when saving using
encoding='utf-8'
the file is saved but again cannot be read in 3.x.TypeError: lookup() argument must be str, not numpy.bytes_
Hi, I met a similar issue. The dataframe was saved in Python 2.7 with format ='table', encoding ='utf-8'
. However, when I read it in Python 3.7 by pd.read_hdf('xxx.hdf', key='xx',encoding = 'utf-8'). The error shows like:lookup() argument must be str, not numpy.bytes_