Pandas read_hdf throws UnicodeDecodeError with Python 3.5 and 3.6 but not with Python 2.7

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.read_hdf('data.h5')

Problem description

The HDF5 dataset was created with pandas, to_hdf in Python 2.7 and can be read in by Python 2.7. When I try to read it in with Python 3.5 or Python 3.6, I get the following:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-53006689fd2c> in <module>()
----> 1 df = pd.read_hdf(data.h5')

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
    356                                      'contains multiple datasets.')
    357             key = candidate_only_group._v_pathname
--> 358         return store.select(key, auto_close=auto_close, **kwargs)
    359     except:
    360         # if there is an error, close the store

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    720                            chunksize=chunksize, auto_close=auto_close)
    721 
--> 722         return it.get_result()
    723 
    724     def select_as_coordinates(

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
   1426 
   1427         # directly return the result
-> 1428         results = self.func(self.start, self.stop, where)
   1429         self.close()
   1430         return results

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
    713             return s.read(start=_start, stop=_stop,
    714                           where=_where,
--> 715                           columns=columns, **kwargs)
    716 
    717         # create the iterator

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, start, stop, **kwargs)
   2864             blk_items = self.read_index('block%d_items' % i)
   2865             values = self.read_array('block%d_values' % i,
-> 2866                                      start=_start, stop=_stop)
   2867             blk = make_block(values,
   2868                              placement=items.get_indexer(blk_items))

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
   2413         import tables
   2414         node = getattr(self.group, key)
-> 2415         data = node[start:stop]
   2416         attrs = node._v_attrs
   2417

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
    673             start, stop, step = self._process_range(
    674                 key.start, key.stop, key.step)
--> 675             return self.read(start, stop, step)
    676         # Try with a boolean or point selection
    677         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
   1226         if array.size == 0:
   1227             return None
-> 1228         return six.moves.cPickle.loads(array.tostring())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

In [1]: import pandas as pd
In [2]: df = pd.read_hdf('data.h5')

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-3-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-3-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.11.2 scipy: 0.18.1 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.1 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: gfyoung

@zoof : Thanks for reporting this. Strange that it's in this order and not vice-versa (support for unicode is much better in Python 3.x than in Python 2.x).

I see that you are using 0.20.1. Just for reference, can you try upgrading and see if that changes anything?

@jreback : I seem to be recalling a previous issue similar to this. Am I right about that or not?

Comment From: zoof

Updated to 0.20.3:

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Comment From: jreback

show what u wrote and how

Comment From: zoof

Sorry, basically the same as before:

In [1]: import pandas as pd

In [2]: pd.read_hdf('data.h5')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-21d8820a6af9> in <module>()
----> 1 pd.read_hdf('data.h5')

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs)
    370                                      'contains multiple datasets.')
    371             key = candidate_only_group._v_pathname
--> 372         return store.select(key, auto_close=auto_close, **kwargs)
    373     except:
    374         # if there is an error, close the store

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    740                            chunksize=chunksize, auto_close=auto_close)
    741 
--> 742         return it.get_result()
    743 
    744     def select_as_coordinates(

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
   1447 
   1448         # directly return the result
-> 1449         results = self.func(self.start, self.stop, where)
   1450         self.close()
   1451         return results

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
    733             return s.read(start=_start, stop=_stop,
    734                           where=_where,
--> 735                           columns=columns, **kwargs)
    736 
    737         # create the iterator

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, start, stop, **kwargs)
   2885             blk_items = self.read_index('block%d_items' % i)
   2886             values = self.read_array('block%d_values' % i,
-> 2887                                      start=_start, stop=_stop)
   2888             blk = make_block(values,
   2889                              placement=items.get_indexer(blk_items))

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
   2434         import tables
   2435         node = getattr(self.group, key)
-> 2436         data = node[start:stop]
   2437         attrs = node._v_attrs
   2438 

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
    673             start, stop, step = self._process_range(
    674                 key.start, key.stop, key.step)
--> 675             return self.read(start, stop, step)
    676         # Try with a boolean or point selection
    677         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
   1226         if array.size == 0:
   1227             return None
-> 1228         return six.moves.cPickle.loads(array.tostring())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)

Comment From: jreback

you are not answering the question; show an example of WRITING

Comment From: zoof

I guess you want a sample dataset? I extracted a small sample from the troublesome series in the large HDF file: https://ufile.io/l94bf. This file too works with Python 2.7 but fails with Python 3.x.

Comment From: jreback

you need to show a complete example that includes writing and reading

Comment From: zoof

Like this? The data in each instance is the same, just different sources.

In [3]: pd.DataFrame(['Executive Director of HR',
 'Assistant Director of HR',
 'Instructor Chair of Paramedics',
 'Proctor Testing Center \xe2\x80\x93 PT',
 'Instructor \xe2\x80\x93 Welding (Automotive)',
 'Lab Tech \xe2\x80\x93 Automotive \xe2\x80\x93 PT',
 'Lab Tech Technology \xe2\x80\x93 PT',
 'Maintenance Tech',
 'Business Services Coordinator',
 'Scheduler']).to_hdf('data.h5','data')

In [4]: pd.read_hdf('data.h5')
Out[4]: 
                                     0
0             Executive Director of HR
1             Assistant Director of HR
2       Instructor Chair of Paramedics
3        Proctor Testing Center â PT
4  Instructor â Welding (Automotive)
5       Lab Tech â Automotive â PT
6           Lab Tech Technology â PT
7                     Maintenance Tech
8        Business Services Coordinator
9                            Scheduler

In [5]: pd.read_hdf('data16.h5')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-5-2a895d140f15> in <module>()
----> 1 pd.read_hdf('data16.h5')

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs)
    370                                      'contains multiple datasets.')
    371             key = candidate_only_group._v_pathname
--> 372         return store.select(key, auto_close=auto_close, **kwargs)
    373     except:
    374         # if there is an error, close the store

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    740                            chunksize=chunksize, auto_close=auto_close)
    741 
--> 742         return it.get_result()
    743 
    744     def select_as_coordinates(

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
   1447 
   1448         # directly return the result
-> 1449         results = self.func(self.start, self.stop, where)
   1450         self.close()
   1451         return results

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
    733             return s.read(start=_start, stop=_stop,
    734                           where=_where,
--> 735                           columns=columns, **kwargs)
    736 
    737         # create the iterator

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, **kwargs)
   2751         kwargs = self.validate_read(kwargs)
   2752         index = self.read_index('index', **kwargs)
-> 2753         values = self.read_array('values', **kwargs)
   2754         return Series(values, index=index, name=self.name)
   2755 

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
   2434         import tables
   2435         node = getattr(self.group, key)
-> 2436         data = node[start:stop]
   2437         attrs = node._v_attrs
   2438 

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
    673             start, stop, step = self._process_range(
    674                 key.start, key.stop, key.step)
--> 675             return self.read(start, stop, step)
    676         # Try with a boolean or point selection
    677         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
   1226         if array.size == 0:
   1227             return None
-> 1228         return six.moves.cPickle.loads(array.tostring())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)

Comment From: zoof

Use 2.7

pd.DataFrame(['Executive Director of HR',
 'Assistant Director of HR',
 'Instructor Chair of Paramedics',
 'Proctor Testing Center \xe2\x80\x93 PT',
 'Instructor \xe2\x80\x93 Welding (Automotive)',
 'Lab Tech \xe2\x80\x93 Automotive \xe2\x80\x93 PT',
 'Lab Tech Technology \xe2\x80\x93 PT',
 'Maintenance Tech',
 'Business Services Coordinator',
 'Scheduler']).to_hdf('data.h5','data')

Try to read in 3.6

pd.read_hdf('data.h5')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-8ce48fe594b7> in <module>()
----> 1 pd.read_hdf('data.h5')

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs)
    370                                      'contains multiple datasets.')
    371             key = candidate_only_group._v_pathname
--> 372         return store.select(key, auto_close=auto_close, **kwargs)
    373     except:
    374         # if there is an error, close the store

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    740                            chunksize=chunksize, auto_close=auto_close)
    741 
--> 742         return it.get_result()
    743 
    744     def select_as_coordinates(

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
   1447 
   1448         # directly return the result
-> 1449         results = self.func(self.start, self.stop, where)
   1450         self.close()
   1451         return results

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
    733             return s.read(start=_start, stop=_stop,
    734                           where=_where,
--> 735                           columns=columns, **kwargs)
    736 
    737         # create the iterator

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, start, stop, **kwargs)
   2885             blk_items = self.read_index('block%d_items' % i)
   2886             values = self.read_array('block%d_values' % i,
-> 2887                                      start=_start, stop=_stop)
   2888             blk = make_block(values,
   2889                              placement=items.get_indexer(blk_items))

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
   2434         import tables
   2435         node = getattr(self.group, key)
-> 2436         data = node[start:stop]
   2437         attrs = node._v_attrs
   2438 

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
    673             start, stop, step = self._process_range(
    674                 key.start, key.stop, key.step)
--> 675             return self.read(start, stop, step)
    676         # Try with a boolean or point selection
    677         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
   1226         if array.size == 0:
   1227             return None
-> 1228         return six.moves.cPickle.loads(array.tostring())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)

Comment From: jreback

This is not supported for fixed stores, try using format='table' when you save in 2.7

Comment From: jreback

you can also see https://github.com/pandas-dev/pandas/issues/11126, and try passing encoding='utf-8' in 2.7.

Comment From: zoof

The former works but the latter does not. I don't see why this is not a bug though since 2.7 can read the file produced without format='table' but 3.x cannot.

Comment From: jreback

it is simply not supported but the underllhing infrastructure (e.g. in PyTables).

Comment From: zoof

Just a postscript. format='table' only works for a single column of data. When trying to save the entire dataset in Python 2.7,

TypeError: Cannot serialize the column [task_list] because
its data contents are [unicode] object dtype

when saving using encoding='utf-8' the file is saved but again cannot be read in 3.x. TypeError: lookup() argument must be str, not numpy.bytes_

Comment From: asanakoy

I have the same issue. Why did you decide not to fix it?

Comment From: asanakoy

As a workaround I'm currently converting my python2.7 dataframes in JSON and then read them using python3.6.

# Run this in py2.7
#####################
import pandas as pd

# read dataframe in py2.7
path = 'df.hdf5' # path to dataframe saved in py2.7
df = pd.read_hdf(path)
json_string = pd.to_json(compression='gzip')
with open('df.json.gz', 'w') as fp:
    fp.write(json_string)

#####################
# Now run in py3.6
#####################
import pandas as pd
with open('df.json.gz', 'r') as fp:
    json_string = fp.read()
df = pd.read_json(json_string)

Comment From: envhyf

Just a postscript. format='table' only works for a single column of data. When trying to save the entire dataset in Python 2.7,

TypeError: Cannot serialize the column [task_list] because its data contents are [unicode] object dtype

when saving using encoding='utf-8' the file is saved but again cannot be read in 3.x. TypeError: lookup() argument must be str, not numpy.bytes_

Hi, I met a similar issue. The dataframe was saved in Python 2.7 with format ='table', encoding ='utf-8'. However, when I read it in Python 3.7 by pd.read_hdf('xxx.hdf', key='xx',encoding = 'utf-8'). The error shows like:lookup() argument must be str, not numpy.bytes_

Pandas read_hdf throws UnicodeDecodeError with Python 3.5 and 3.6 but not with Python 2.7

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`