Code Sample, a copy-pastable example if possible
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4],
'b': ['foo', 'bar', 'baz', 'qux'],
'c': [5, 6, 7, 8]
})
print(df['a'].values.strides)
store = pd.HDFStore('example.h5')
store['df'] = df
print(store['df']['a'].values.strides)
## -- End pasted text --
(8,)
(16,)
Problem description
I ran across this when doing some benchmarking. This has some rather serious performance implications for large DataFrames. Is this the result of an underlying limitation in HDF5?
Comment From: gfyoung
Is this the result of an underlying limitation in HDF5?
I wonder if it has to deal with how we store the values. Tracing the code leads me somewhere to here:
https://github.com/pandas-dev/pandas/blob/f4330611ff5ac1cbb4a89c4a7dab3d0900f9e64a/pandas/io/pytables.py#L4163-L4165
Comment From: TannhauserGate42
I think this problem could even be more fundamental ...
Pandas copy and groupby-sum aggregations (and maybe other operations) change the major-order on the underlying data of the returned object.
This has a huge impact on aggregation performance.
Pandas should not do that implicitly.
Comment From: mroeschke
This looks correct on master. I suppose it could use a test
In [12]: import pandas as pd
...:
...: df = pd.DataFrame({
...: 'a': [1, 2, 3, 4],
...: 'b': ['foo', 'bar', 'baz', 'qux'],
...: 'c': [5, 6, 7, 8]
...: })
...:
...: print(df['a'].values.strides)
...:
...: store = pd.HDFStore('example.h5')
...: store['df'] = df
...:
...: print(store['df']['a'].values.strides)
(8,)
(8,)
Comment From: johnmantios
are we looking for unit test or performance test here?
Comment From: mroeschke
are we looking for unit test or performance test here?
Unit test
Comment From: johnmantios
take
Comment From: jorisvandenbossche
I think the core issue isn't actually solved. The only reason we now get back a proper column-major dataframe is because the read()
makes a copy of the data (this happens in concat
, by default it makes a copy of the input):
https://github.com/pandas-dev/pandas/blob/a0071f9c9674b8ae24bbcaad95a9ba70dcdcd423/pandas/io/pytables.py#L3207-L3212
But at this point, before the concat
call, the data is still in row-major format. So if we want to avoid making this additional copy, while preserving column-major layout, we need to investigate why a column-major dataframe data gets stored / returned as row-major.