Comparing vanilla installs of Pandas 1.5.3 vs 2.0.1 (and also vs 2.0.0) when selecting columns from a DataFrame with columns having a MultiIndex constructed via pd.concat the performance is noticeably slower in 2.0.0, unless manual consolidation is forced.
I was unable to fully follow through the code differences between the Pandas versions, but it appears that in 1.5.3 the first select performs consolidation whereas 2.0.0+ never automatically consolidates and thus causes each subsequent select to be much slower.
As parts of my production code follow a pattern of creating DataFrames via concat and then looping over column subsets this introduces a ~100x slowdown in parts (due to very wide multilevel DFs).
import timeit
import numpy as np
import pandas as pd
def _debug(force_consolidate):
df = _make_column_multilevel_df_via_concat(force_consolidate)
level_a_cols = df.columns.unique('A')
print(f'Running once {force_consolidate=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))
print(f'Running once again {force_consolidate=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))
print(f'Running ten times {force_consolidate=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=10))
def _make_column_multilevel_df_via_concat(force_consolidate):
a_values = list(range(16))
b_values = list(range(50))
idx = pd.bdate_range('1991-01-01', '2023-12-31', name='date')
template_frame = pd.DataFrame(np.zeros((len(idx), len(b_values))), index=idx, columns=pd.Index(b_values, name='B'))
df = pd.concat({a: template_frame for a in a_values}, axis=1, names=['A'])
if force_consolidate:
df._consolidate_inplace()
return df
def select_each_column(df, cols):
# noinspection PyStatementEffect
[df[c] for c in cols]
if __name__ == '__main__':
pd.show_versions()
_debug(False)
print()
_debug(True)
1.5.3
Running once force_consolidate=False
0.027367000002413988
Running once again force_consolidate=False
0.0026708999648690224
Running ten times force_consolidate=False
0.024182399967685342
Running once force_consolidate=True
0.0028768000192940235
Running once again force_consolidate=True
0.0026675001718103886
Running ten times force_consolidate=True
0.024797999998554587
2.0.1
Running once force_consolidate=False
0.1871038000099361
Running once again force_consolidate=False
0.18837619991973042
Running ten times force_consolidate=False
1.8612669999711215
Running once force_consolidate=True
0.002792700193822384
Running once again force_consolidate=True
0.002532100072130561
Running ten times force_consolidate=True
0.02379040000960231
You can see that force_consolidate=True makes no difference for 1.5.3 but does for 2.0.0. Further, "Running once" vs "Running once again" in 1.5.3 shows the cost of the consolidation happening where it is has not already been forced.
Note, the benchmark above ignores the cost of the consolidate in 2.0.0, but I hope the difference in behaviour between versions is clear.
Reading the release notes, a lot of work has gone into the block manager and I can see that other users might not want the previous behaviour as the cost of consolidation could outweigh the cost of selecting from the DataFrame.
I am thus unsure if the change in behaviour is expected or not. If it is expected, is there a cleaner way to force consolidation other than calling the undocumented consolidate/consolidate_inplace?
I also feel obliged to say thank you for all of the excellent work in this project -- it is greatly appreciated!
Installed Versions
1.5.3INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Canada.1252 pandas : 1.5.3 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.5.0 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None2.0.0
INSTALLED VERSIONS ------------------ commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.10.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Canada.1252 pandas : 2.0.0 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.5.0 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None2.0.1
INSTALLED VERSIONS ------------------ commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.10.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Canada.1252 pandas : 2.0.1 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.5.0 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
Comment From: phofl
This does not happen in concat, the previous consolidation happened in a self.values call when selecting columns, which was removed. I think this was intended @jbrockmendel ?
Comment From: jbrockmendel
getting rid of silent consolidation was definitely intentional. Where does the self.values lookup occur? We could warn on .values if a consolidation would improve perf (would be ugly but could even check for repeated .values checks)
using Cow gets a 3.5x speedup.
Comment From: phofl
Yes Copy-on-Write seems like a good solution here.
The self.values happened under the hood In _getitem_multilevel
, but we got rid of it since then, dispatching to iloc now.
Comment From: phofl
That being said: This is a very specialised case, e.g. only one dtype. If you add a string column, the non-consolidation operation is actually faster, but both are considerably slower.
I'd recommend using CoW, performance is the same with multiple dtypes as with an single one!
Comment From: jbrockmendel
I'm guessing this goes through the _reindex_multi path. Two things to try out: 1) get rid of that path altogether, 2) only go through it in already-consolidated cases
Comment From: muttermachine
Thank you for the quick replies. One thing to note, extending the basic performance test shows worse performance with CoW enabled unless the manual consolidation is triggered.
I am no expert on the internals of Pandas, but if there is a useful/reasonable/hacky change for me to try and make to see if there is a reasonable approach to take, I would be willing to try it out.
2.0.1
Running once force_consolidate=False cow=False
0.18940009991638362
Running once again force_consolidate=False cow=False
0.1838280998636037
Running ten times force_consolidate=False cow=False
1.8617813000455499
Running once force_consolidate=False cow=True
0.29346469999291
Running once again force_consolidate=False cow=True
0.29061580006964505
Running ten times force_consolidate=False cow=True
3.1367222000844777
Running once force_consolidate=True cow=False
0.002467299811542034
Running once again force_consolidate=True cow=False
0.002005200134590268
Running ten times force_consolidate=True cow=False
0.020947899902239442
Running once force_consolidate=True cow=True
0.002265599789097905
Running once again force_consolidate=True cow=True
0.002035900019109249
Running ten times force_consolidate=True cow=True
0.01990539999678731
Modified code:
import itertools
import timeit
import numpy as np
import pandas as pd
def _debug(force_consolidate, cow):
with pd.option_context('mode.copy_on_write', cow):
df = _make_column_multilevel_df_via_concat(force_consolidate)
level_a_cols = df.columns.unique('A')
print(f'Running once {force_consolidate=} {cow=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))
print(f'Running once again {force_consolidate=} {cow=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))
print(f'Running ten times {force_consolidate=} {cow=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=10))
def _make_column_multilevel_df_via_concat(force_consolidate):
a_values = list(range(16))
b_values = list(range(50))
idx = pd.bdate_range('1991-01-01', '2023-12-31', name='date')
template_frame = pd.DataFrame(np.zeros((len(idx), len(b_values))), index=idx, columns=pd.Index(b_values, name='B'))
df = pd.concat({a: template_frame for a in a_values}, axis=1, names=['A'])
if force_consolidate:
df._consolidate_inplace()
return df
def select_each_column(df, cols):
# noinspection PyStatementEffect
[df[c] for c in cols]
if __name__ == '__main__':
pd.show_versions()
for (force_consolidate, cow) in itertools.product([False, True], [False, True]):
_debug(force_consolidate=force_consolidate, cow=cow)
print()
Comment From: phofl
Yikes, I will look into this.
This is what I am getting on main:
Running once force_consolidate=False cow=False
0.007546624983660877
Running once again force_consolidate=False cow=False
0.004861583933234215
Running ten times force_consolidate=False cow=False
0.05707850004546344
Running once force_consolidate=False cow=True
0.001773166935890913
Running once again force_consolidate=False cow=True
0.0014914589701220393
Running ten times force_consolidate=False cow=True
0.0145230001071468
Running once force_consolidate=True cow=False
0.0015935830306261778
Running once again force_consolidate=True cow=False
0.0013050410198047757
Running ten times force_consolidate=True cow=False
0.013885915977880359
Running once force_consolidate=True cow=True
0.0016619160305708647
Running once again force_consolidate=True cow=True
0.001402582973241806
Running ten times force_consolidate=True cow=True
0.012952166958712041
Results on 2.0.1 is similar to yours. We should check whether we can backport the commit that improves performance here...
Comment From: muttermachine
Well the results on main do indeed look good under CoW! For my specific use case, given the root-change was intentional, am fine to manually consolidate where I need to for now and to spend the time on ensuring my codebase can switch to CoW safely.
If do not have a pandas dev env set-up but is there a way I could help track down the commit in question to make a backport easier? I presume git bisect and the like get expensive needing the full compile etc?
Comment From: muttermachine
Ah I see you've already tracked it down. Thank you again for all the help, quick replies, and indeed the work on Pandas generally!
Comment From: phofl
Thanks for offering. Already found it. Bisects are expensive yes, but depends a bit on your hardware.
Comment From: phofl
I'll reopen till we did the actual backport. Thx
Comment From: phofl
Backport pr has been merged, this will be fixed on 2.0.2 for CoW
Comment From: muttermachine
Much appreciated - thank you very much
Comment From: jbrockmendel
Do we need to do anything for non-CoW case?