Pandas BUG: Column-major DataFrames stored in HDFStore are returned as row-major

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3, 4],
    'b': ['foo', 'bar', 'baz', 'qux'],
    'c': [5, 6, 7, 8]
})

print(df['a'].values.strides)

store = pd.HDFStore('example.h5')
store['df'] = df

print(store['df']['a'].values.strides)
## -- End pasted text --
(8,)
(16,)

Problem description

I ran across this when doing some benchmarking. This has some rather serious performance implications for large DataFrames. Is this the result of an underlying limitation in HDF5?

Comment From: gfyoung

Is this the result of an underlying limitation in HDF5?

I wonder if it has to deal with how we store the values. Tracing the code leads me somewhere to here:

https://github.com/pandas-dev/pandas/blob/f4330611ff5ac1cbb4a89c4a7dab3d0900f9e64a/pandas/io/pytables.py#L4163-L4165

Comment From: TannhauserGate42

I think this problem could even be more fundamental ...

Pandas copy and groupby-sum aggregations (and maybe other operations) change the major-order on the underlying data of the returned object.

This has a huge impact on aggregation performance.

Pandas should not do that implicitly.

Comment From: mroeschke

This looks correct on master. I suppose it could use a test

In [12]: import pandas as pd
    ...:
    ...: df = pd.DataFrame({
    ...:     'a': [1, 2, 3, 4],
    ...:     'b': ['foo', 'bar', 'baz', 'qux'],
    ...:     'c': [5, 6, 7, 8]
    ...: })
    ...:
    ...: print(df['a'].values.strides)
    ...:
    ...: store = pd.HDFStore('example.h5')
    ...: store['df'] = df
    ...:
    ...: print(store['df']['a'].values.strides)
(8,)
(8,)

Comment From: johnmantios

are we looking for unit test or performance test here?

Comment From: mroeschke

are we looking for unit test or performance test here?

Unit test

Comment From: johnmantios

take

Comment From: jorisvandenbossche

I think the core issue isn't actually solved. The only reason we now get back a proper column-major dataframe is because the read() makes a copy of the data (this happens in concat, by default it makes a copy of the input):

https://github.com/pandas-dev/pandas/blob/a0071f9c9674b8ae24bbcaad95a9ba70dcdcd423/pandas/io/pytables.py#L3207-L3212

But at this point, before the concat call, the data is still in row-major format. So if we want to avoid making this additional copy, while preserving column-major layout, we need to investigate why a column-major dataframe data gets stored / returned as row-major.