In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '1.4.0.dev0+143.g5675cd8ab2'
In [4]: s = pd.concat([pd.Series(list('abc'))] * 100_000)
In [5]: %timeit s.str.upper()
48.6 ms ± 420 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: s1 = pd.concat([pd.Series(list('abc'), dtype='string[pyarrow]')] * 100_000)
In [7]: %timeit s1.str.upper()
308 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: s2 = s.astype('string[pyarrow]')
In [11]: %timeit s2.str.upper()
1.38 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [12]: s1._mgr.blocks[0].values._data.num_chunks
Out[12]: 100000
In [13]: s2._mgr.blocks[0].values._data.num_chunks
Out[13]: 1
If you naively concat series, then the pyarrow storage is not-rechunked. This can lead to dramatic performance issues.
e.g. you get 100k chunks above, vs 1 in the astype operation.
Comment From: jreback
cc @simonjayhawkins @jorisvandenbossche
Comment From: jorisvandenbossche
I think in general it is good that concat can be done without a copy of the data, but we should probably have some "rechunking" facilities and some logic to decide when to rechunk and when not.
Based on a minimum target number of elements for a single chunk, we could check after the concat whether there is any chunk below that number (or more cheaply if the average is below that number, i.e. (_data.length() / _data.num_chunks) < MIN_ELEMENTS
).
Pyarrow has a ChunkedArray.combine_chunks
, but that will flatten into a single non-chunked array, which is not necessarily what we want for large data. So we will have to write something custom (although we could maybe also upstream it to pyarrow).
Comment From: simonjayhawkins
changing milestone to 1.3.5
Comment From: simonjayhawkins
I think in general it is good that concat can be done without a copy of the data, but we should probably have some "rechunking" facilities and some logic to decide when to rechunk and when not.
adding enhancement and needs discussion labels and moving off 1.3.5 milestone.