Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
data = np.random.rand(3,3,3)
major_axis = ['maj_ax_{}'.format(i) for i in range(3) ]
minor_axis = ['min_ax_{}'.format(i) for i in range(3) ]
items = ['item_{}'.format(i) for i in range(3) ]
pnl = pd.Panel(
data=data,
major_axis=major_axis,
minor_axis=minor_axis,
items=items
)
pnl.to_hdf('test.h5', key='df')
reread_data = pd.read_hdf('test.h5')
reread_data.to_excel('data.xlsx')
Problem description
With the Panel class being deprecated and is going to be removed in the future, exporting and reading 3D data to and from an HDF5 file will be lost, excel I/O for 3D data as well. At the moment, Xarray that is proposed as a replacement for handling 3D data does not provide this functionality.
Expected Output
An HDF5 and excel file that contains 3D data.
Output of pd.show_versions()
Comment From: jorisvandenbossche
cc @shoyer @MaximilianR
I think the read_excel
can be "rather easily" done manually by the user if he/she wants that (since it is just writing different parts of the frame to different sheets. Of course it will be less convenient, but not impossible). The HDF5 seems more problematic to overcome with current proposed alternatives.
Is there support for HDF5 planned in xarray?
Comment From: TomAugspurger
xarray already has netcdf support; is that sufficient?
Comment From: shsymoen
Is this possible with xarray and netcdf? Appending a 2D dataframe or series to the already created HDF5 file.
import pandas as pd
import numpy as np
store = pd.HDFStore('test.h5')
data = np.random.rand(4,4)
df_raw_data = pd.DataFrame(data=data)
df_raw_data.to_hdf(store, key='raw_data', mode='a')
store.close()
Comment From: TomAugspurger
@shsymoen I'm not sure I understand your example. That's writing a DataFrame, which will continue to work.
Since panel is deprecated, in future versions of pandas you'll need to either
- use a MultiIndexed pandas object, and write that to HDF5
- Build an xarray object and serialize it as documented here
Comment From: shsymoen
@TomAugspurger I realized my example was not really clear and added the line that it append the DataFrame to the already created HDF5 file. For me the MultiIndexed pandas objects are less intuitive to use. And I don't know whether it is possible to append a DataFrame to a serialized xarray object (no experience with xarray or netcdf)
I will try to clarify my use case a bit better: I'm starting from a single DataFrame called: 'raw' I perform some operations on the raw DataFrame resulting in a new DataFrame with the same indices and column names: 'processed_1' Different operations: 'processed_2'
I'm adding the newly created DataFrames one by one to a dictionary. In the end I convert the dict to a Panel to be able to export to an HDF5 file containing some DataFrames in different keys. I hope this explanation makes it more clear.
Comment From: TomAugspurger
Can you iterate over the key-value pairs of your dictionary and write each to the HDF5 file under a separate key?
Comment From: shsymoen
In principle that's possible. But I think the resulting file will be larger because you are saving, for every DataFrame, the same column names and indices. Reading the data from the HDF5 is also more complicated due to the extra iteration step.
Comment From: max-sixty
That's a good case, thanks for sharing @shsymoen
I think the options are:
1. Build workarounds, such as a dictionary of dataframes
2. Force people to use tabular data, and use MultiIndex
es
3. Add this functionality into xarray
4. Leave Panel
hanging around in a deprecated state
5. Reverse deprecation of Panel
I would imagine we can manage through with a combination of 1-3, depending on the case.
There is already some similar functionality in xarray
as @TomAugspurger mentioned re netCDF. That could be expanded to a more general HDF5 case. I think there's less overlap between the xarray
and MS Excel communities, so 1&2 likely more appropriate there
Comment From: shoyer
Pandas uses a customized HDF5 format (based on pytables) for serializing pandas.Panel
. In principle, we could read/write it in xarray but there are plans to do so. Our primary file format is NetCDF, which is a different file format also based on HDF5 (when using NetCDF4, which is the default).
Xarray could be used for reading Excel data but I agree that it's not a great fit, in part because the flexibility of spreadsheets means they come in lots of different formats. My inclination would be to read Excel files with multiple sheets into a DataFrame with a MultiIndex of some sort. That would keep all Excel IO within pandas. If you really need 3D data structures, converting into xarray from pandas data structures with a MultiIndex is usually quite straightforward.
Comment From: bashtage
pnl.to_frame().to_hdf('test.h5', key='df')
works fine and is a valid representation for a 3D data set.
Of course one could also use
store = pd.HDFStore('test.h5', mode='w')
for item in pnl:
store.put(item,pnl[item])
store.close()
as a 3-d store for data.
Comment From: bashtage
To be clear, have recently worked with Panel
and MultIndex
df, there is a strong case for MI df over Panel. In particular, in a Panel there is no way to correctly store variables like a Cateogrical
. It is also easily possible to end up with different data types for the same variables, e.g. int
for columns in a column of a panel's item, and float64
for the same variable for a different "entity" since it has one or more missing values. Using a MI df avoids both of these, at the cost of learning a different access pattern.
Comment From: shsymoen
I agree for the solution of iterating over the key-value pairs of the dict for outputting to excel but I think you lose some functionality if one has to do the same for storing to HDF5. Is there a possibility to make a MultIndex df of the dictionary that holds multiple DataFrames?
Comment From: bashtage
I can't think of a 1-liner since it is necessary to rename columns using the dict's key.
d = {'a':pd.DataFrame(np.random.randn(10,2)),
'b':pd.DataFrame(np.random.randn(10,2))}
dfs=[]
cols=[]
for k,v in d.items():
dfs.append(v.stack())
cols.append(k)
df = pd.concat(dfs,1)
df.columns=cols
Comment From: shoyer
You can do this with the keys
argument to concat
, e.g., pd.concat(dfs.values(), keys=df.keys())
:
https://stackoverflow.com/questions/23600582/concatenate-pandas-columns-under-new-multi-index-level
Comment From: bashtage
Then I suppose one could have a not-totally-horrible one-liner:
df = pd.concat([v.stack() for v in d.values()],1, keys=d.keys())
Comment From: jreback
closing as this is purely a user issue. if anyone feels the docs needs enhacement pls feel free. the solutions above are quite adequate though.