Code Sample, a copy-pastable example if possible
df = pd.DataFrame(np.random.randint(0, high=2**16 -1, size=(20,5), dtype=np.uint16),
columns=['a', 'b', 'c', 'd', 'e'])
df = df.set_index(['a', 'b'])
store = pd.HDFStore('test.h5')
store.append('test', df, append=True)
>>>NotImplementedError: indexing 64-bit unsigned integer columns is not supported yet, sorry
Problem description
When using a MultiIndex, pandas is coercing the uint16 columns to uint64. Then when I try to write that to an H5 file (in table format), the NotImplementedError is raised by pytables.
Expected Output
Expect H5 file to be created, with pandas coercing types as required (in this case uint16 to int64 is safe). If using a UInt64Index (without a MultiIndex), pandas does coerce correctly:
df = pd.DataFrame(np.random.randint(0, high=2**16 -1, size=(20,5), dtype=np.uint16),
columns=['a', 'b', 'c', 'd', 'e'])
df = df.set_index(['a'])
df.index.dtype
>>>dtype('uint64')
store = pd.HDFStore('test.h5')
store.append('test2', df, append=True)
store.test2.index.dtype
>>>dtype('int64')
Output of pd.show_versions()
Comment From: gfyoung
@kylekeppler : Thanks for reporting this! I think you should file an issue with pytables
, as we've been spending a lot of time trying to bulk up support for uint64
. However, it does seem like this is an issue across many libraries, who seem to stop at int64
.
As a workaround, I think you should cast to float
, though that is going to destroy precision.
Comment From: kylekeppler
@gfyoung: Agree this would be nice to be fixed in pytables
but this NotImplementedError
has been thrown since at least version 2.3 from 2011, so it doesn't look like they are in any hurry to fix that.
In my case at least it made sense to cast to int64
manually as is done without a MultiIndex. I'd say pandas should do that levels of a MulitIndex as well.
Comment From: gfyoung
this NotImplementedError has been thrown since at least version 2.3 from 2011, so it doesn't look like they are in any hurry to fix that.
I'm not sure I follow you here. The issue may have existed for some time because no one has asked about it. We have had the same issue as well in pandas
and only began patching it when people started asking about it. I would suggest that you file an issue in pytables
and see how they respond.
On our end, I would be hesitant to cast to int64
(or perform any downcasting acrobatics) just for the sake of accommodating another library. We don't like to destroy dtype
if possible. That being said, if we were to do that, I guess we could always cast to float64
only in cases when there are elements with values greater than 2**63 - 1
and cast to an int*
dtype otherwise.
However, I'm not sure. @jreback ? @jorisvandenbossche ?
Comment From: kylekeppler
@gfyoung, agree with your comments. I am only proposing that the Uint64Index
case and the MultiIndex
with an uint64
level behave the same.
Comment From: jreback
So this works fine to store your data. The issue is that you are trying to actually index on the columns (which is what happens when you store as a MultiIndex, these columns become indexer).
This is good
In [7]: df.reset_index().to_hdf('test.h5', 'df', mode='w', format='table')
In [8]: pd.read_hdf('test.h5', 'df').dtypes
Out[8]:
a uint64
b uint64
c uint16
d uint16
e uint16
dtype: object
If you try to index.
In [9]: df.reset_index().to_hdf('test2.h5', 'df', mode='w', format='table', data_columns=True)
NotImplementedError: indexing 64-bit unsigned integer columns is not supported yet, sorry
So not really sure what pandas can do about this. I would simply not try to index using uint64 columns until the support is there.
closing as won't fix. You should open an issue on the pytables tracker.