Pandas Pandas panel deprecation will remove I/O functionality to HDF5/excel

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

data = np.random.rand(3,3,3)
major_axis = ['maj_ax_{}'.format(i) for i in range(3) ]
minor_axis = ['min_ax_{}'.format(i) for i in range(3) ]
items = ['item_{}'.format(i) for i in range(3) ]

pnl = pd.Panel(
    data=data,
    major_axis=major_axis,
    minor_axis=minor_axis,
    items=items
)

pnl.to_hdf('test.h5', key='df')

reread_data = pd.read_hdf('test.h5')

reread_data.to_excel('data.xlsx')

Problem description

With the Panel class being deprecated and is going to be removed in the future, exporting and reading 3D data to and from an HDF5 file will be lost, excel I/O for 3D data as well. At the moment, Xarray that is proposed as a replacement for handling 3D data does not provide this functionality.

Expected Output

An HDF5 and excel file that contains 3D data.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.20.2 pytest: 2.8.5 pip: 9.0.1 setuptools: 20.7.0 Cython: 0.23.4 numpy: 1.12.1 scipy: 0.19.0 xarray: 0.9.6 IPython: 5.1.0 sphinx: 1.6.2 patsy: 0.4.1 dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: 1.2.0 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.6.0 bs4: 4.4.1 html5lib: 0.999 sqlalchemy: 1.0.11 pymysql: None psycopg2: None jinja2: 2.8 s3fs: 0.0.7 pandas_gbq: None pandas_datareader: None

Comment From: jorisvandenbossche

cc @shoyer @MaximilianR

I think the read_excel can be "rather easily" done manually by the user if he/she wants that (since it is just writing different parts of the frame to different sheets. Of course it will be less convenient, but not impossible). The HDF5 seems more problematic to overcome with current proposed alternatives. Is there support for HDF5 planned in xarray?

Comment From: TomAugspurger

xarray already has netcdf support; is that sufficient?

Comment From: shsymoen

Is this possible with xarray and netcdf? Appending a 2D dataframe or series to the already created HDF5 file.

import pandas as pd
import numpy as np

store = pd.HDFStore('test.h5')

data = np.random.rand(4,4)
df_raw_data = pd.DataFrame(data=data)

df_raw_data.to_hdf(store, key='raw_data', mode='a')
store.close()

Comment From: TomAugspurger

@shsymoen I'm not sure I understand your example. That's writing a DataFrame, which will continue to work.

Since panel is deprecated, in future versions of pandas you'll need to either

use a MultiIndexed pandas object, and write that to HDF5
Build an xarray object and serialize it as documented here

Comment From: shsymoen

@TomAugspurger I realized my example was not really clear and added the line that it append the DataFrame to the already created HDF5 file. For me the MultiIndexed pandas objects are less intuitive to use. And I don't know whether it is possible to append a DataFrame to a serialized xarray object (no experience with xarray or netcdf)

I will try to clarify my use case a bit better: I'm starting from a single DataFrame called: 'raw' I perform some operations on the raw DataFrame resulting in a new DataFrame with the same indices and column names: 'processed_1' Different operations: 'processed_2'

I'm adding the newly created DataFrames one by one to a dictionary. In the end I convert the dict to a Panel to be able to export to an HDF5 file containing some DataFrames in different keys. I hope this explanation makes it more clear.

Comment From: TomAugspurger

Can you iterate over the key-value pairs of your dictionary and write each to the HDF5 file under a separate key?

Comment From: shsymoen

In principle that's possible. But I think the resulting file will be larger because you are saving, for every DataFrame, the same column names and indices. Reading the data from the HDF5 is also more complicated due to the extra iteration step.

Comment From: max-sixty

That's a good case, thanks for sharing @shsymoen

I think the options are: 1. Build workarounds, such as a dictionary of dataframes 2. Force people to use tabular data, and use MultiIndexes 3. Add this functionality into xarray 4. Leave Panel hanging around in a deprecated state 5. Reverse deprecation of Panel

I would imagine we can manage through with a combination of 1-3, depending on the case.

There is already some similar functionality in xarray as @TomAugspurger mentioned re netCDF. That could be expanded to a more general HDF5 case. I think there's less overlap between the xarray and MS Excel communities, so 1&2 likely more appropriate there

Comment From: shoyer

Pandas uses a customized HDF5 format (based on pytables) for serializing pandas.Panel. In principle, we could read/write it in xarray but there are plans to do so. Our primary file format is NetCDF, which is a different file format also based on HDF5 (when using NetCDF4, which is the default).

Xarray could be used for reading Excel data but I agree that it's not a great fit, in part because the flexibility of spreadsheets means they come in lots of different formats. My inclination would be to read Excel files with multiple sheets into a DataFrame with a MultiIndex of some sort. That would keep all Excel IO within pandas. If you really need 3D data structures, converting into xarray from pandas data structures with a MultiIndex is usually quite straightforward.

Comment From: bashtage

pnl.to_frame().to_hdf('test.h5', key='df')

works fine and is a valid representation for a 3D data set.

Of course one could also use

store = pd.HDFStore('test.h5', mode='w')
for item in pnl:
    store.put(item,pnl[item])
store.close()

as a 3-d store for data.

Comment From: bashtage

To be clear, have recently worked with Panel and MultIndex df, there is a strong case for MI df over Panel. In particular, in a Panel there is no way to correctly store variables like a Cateogrical . It is also easily possible to end up with different data types for the same variables, e.g. int for columns in a column of a panel's item, and float64 for the same variable for a different "entity" since it has one or more missing values. Using a MI df avoids both of these, at the cost of learning a different access pattern.

Comment From: shsymoen

I agree for the solution of iterating over the key-value pairs of the dict for outputting to excel but I think you lose some functionality if one has to do the same for storing to HDF5. Is there a possibility to make a MultIndex df of the dictionary that holds multiple DataFrames?

Comment From: bashtage

I can't think of a 1-liner since it is necessary to rename columns using the dict's key.

d = {'a':pd.DataFrame(np.random.randn(10,2)),
     'b':pd.DataFrame(np.random.randn(10,2))}

dfs=[]
cols=[]
for k,v in d.items():
    dfs.append(v.stack())
    cols.append(k)
df = pd.concat(dfs,1)
df.columns=cols

Comment From: shoyer

You can do this with the keys argument to concat, e.g., pd.concat(dfs.values(), keys=df.keys()): https://stackoverflow.com/questions/23600582/concatenate-pandas-columns-under-new-multi-index-level

Comment From: bashtage

Then I suppose one could have a not-totally-horrible one-liner:

df = pd.concat([v.stack() for v in d.values()],1, keys=d.keys())

Comment From: jreback

closing as this is purely a user issue. if anyone feels the docs needs enhacement pls feel free. the solutions above are quite adequate though.

Pandas Pandas panel deprecation will remove I/O functionality to HDF5/excel

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`