When attempting to store data in an HDF5 table, I found a problem where an error is raised if there are multiple object columns containing different data.
import pandas as pd
data = {'ints':pd.Series([1,2,3], index=index), 'Timestamps': pd.Series([pd.Timestamp('2014-1-1 12:00', tz='UTC'), pd.Timestamp('2014-1-2 12:00', tz='UTC'), pd.Timestamp('2014-1-3 12:00', tz='UTC')], index=index), 'strings': pd.Series(['r','g','b'], index=index)}
df = pd.DataFrame(data)
df.to_hdf('test.h5', 'data', format='table')
This leads to an exception: TypeError: Cannot serialize the column [Timestamps] because its data contents are [datetime] object dtype
However, if I remove the string column:
del df['strings']
df.to_hdf('test.h5', 'data', format='table')
Now it works fine - so it isn't a problem with using the pd.Timestamp type.
Digging a little deeper, it appears the problem is that pandas.io.pytables.Table.create_axes groups the columns by data type, with all columns of type object being grouped into one set of data. Then when set_atom is called, it does this:
rvalues = block.values.ravel()
inferred_type = lib.infer_dtype(rvalues)
This leads to an inferred type of 'mixed' since there are multiple types of objects present, and this isn't handled and throws the exception.
As a fix, it seems that each object column should be handled separately, or at least grouped by the inferred type. I haven't committed to pandas before, or dug this deeply into this section of code, so I'm not sure of the best way to fix this and what other implications there may be, but I'd be happy to help however I can.
Comment From: jreback
you can work around this by setting the non-string object columns as data_columns (that will segregate them up front)
if these are truly utc tz aware then to be honest guy should simply make them datetime64[ns] columns and the problem also goes away
you are right though see here : https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L1734 for the inference on an object column (note that they could be a period type, datetime tz aware, or an actual string)
so the object block handling needs to be fixed up a bit - by further splitting of object blocks if necessary
pull - requests welcome!
Comment From: jreback
see #7796 as well (for the period support)
Comment From: jreback
FYI u normally don't handle the columns separately and instead store them as a single block as it's much more efficient (can be controlled by specifying data_columns though)
Comment From: kvncp
Thanks for the fast response. They aren't actually UTC in my application, that was just the easiest way to create a simple example. Setting as a data_column will work though, thanks for the tip.
If I get a bit of time I'll look into a fix.
Comment From: TomAugspurger
I can't reproduce the original example. index
is not defined.
This simple example seems to work
In [36]: df = pd.DataFrame({"A": [1, 2], 'B': ['a', 'b'], 'C': pd.to_datetime(['2017', '2018']).tz_localize("UTC")})
In [37]: df.to_hdf('test.h5', 'data', format='table')
Let me know if that isn't representative of the original.
Comment From: petiop
I am having the same issue where the use-case is storing multidimensional and variable-shape np arrays (unflattened images). I store in 'table' format and I tried adding the column to data_columns
. Still getting the same error:
TypeError: Cannot serialize the column [image] because
its data contents are [mixed] object dtype
Are there other workarounds that I can try? Also, is this issue still open to contributions (beefing up the object-block handling to work with types other than strings)?
Comment From: jreback
there is no support for non scalar types at all
Comment From: petiop
I don’t mind converting them to bytes and saving that, but that too is not supported atm
Comment From: jreback
@petiop you are welcome to submit a PR for this but it’s non-trivial
i would use parquet for this