Pandas PERF: Selecting columns from MultiIndex no longer consolidates (pandas 2.0 regression?)

Comparing vanilla installs of Pandas 1.5.3 vs 2.0.1 (and also vs 2.0.0) when selecting columns from a DataFrame with columns having a MultiIndex constructed via pd.concat the performance is noticeably slower in 2.0.0, unless manual consolidation is forced.

I was unable to fully follow through the code differences between the Pandas versions, but it appears that in 1.5.3 the first select performs consolidation whereas 2.0.0+ never automatically consolidates and thus causes each subsequent select to be much slower.

As parts of my production code follow a pattern of creating DataFrames via concat and then looping over column subsets this introduces a ~100x slowdown in parts (due to very wide multilevel DFs).

import timeit

import numpy as np
import pandas as pd


def _debug(force_consolidate):
    df = _make_column_multilevel_df_via_concat(force_consolidate)
    level_a_cols = df.columns.unique('A')

    print(f'Running once {force_consolidate=}')
    print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

    print(f'Running once again {force_consolidate=}')
    print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

    print(f'Running ten times {force_consolidate=}')
    print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=10))


def _make_column_multilevel_df_via_concat(force_consolidate):
    a_values = list(range(16))
    b_values = list(range(50))
    idx = pd.bdate_range('1991-01-01', '2023-12-31', name='date')
    template_frame = pd.DataFrame(np.zeros((len(idx), len(b_values))), index=idx, columns=pd.Index(b_values, name='B'))
    df = pd.concat({a: template_frame for a in a_values}, axis=1, names=['A'])

    if force_consolidate:
        df._consolidate_inplace()

    return df


def select_each_column(df, cols):
    # noinspection PyStatementEffect
    [df[c] for c in cols]


if __name__ == '__main__':
    pd.show_versions()
    _debug(False)
    print()
    _debug(True)

1.5.3

Running once force_consolidate=False
0.027367000002413988
Running once again force_consolidate=False
0.0026708999648690224
Running ten times force_consolidate=False
0.024182399967685342

Running once force_consolidate=True
0.0028768000192940235
Running once again force_consolidate=True
0.0026675001718103886
Running ten times force_consolidate=True
0.024797999998554587

2.0.1

Running once force_consolidate=False
0.1871038000099361
Running once again force_consolidate=False
0.18837619991973042
Running ten times force_consolidate=False
1.8612669999711215

Running once force_consolidate=True
0.002792700193822384
Running once again force_consolidate=True
0.002532100072130561
Running ten times force_consolidate=True
0.02379040000960231

You can see that force_consolidate=True makes no difference for 1.5.3 but does for 2.0.0. Further, "Running once" vs "Running once again" in 1.5.3 shows the cost of the consolidation happening where it is has not already been forced.

Note, the benchmark above ignores the cost of the consolidate in 2.0.0, but I hope the difference in behaviour between versions is clear.

Reading the release notes, a lot of work has gone into the block manager and I can see that other users might not want the previous behaviour as the cost of consolidation could outweigh the cost of selecting from the DataFrame.

I am thus unsure if the change in behaviour is expected or not. If it is expected, is there a cleaner way to force consolidation other than calling the undocumented consolidate/consolidate_inplace?

I also feel obliged to say thank you for all of the excellent work in this project -- it is greatly appreciated!

Installed Versions

1.5.3

INSTALLED VERSIONS
------------------
commit           : 2e218d10984e9919f0296931d92ea851c6a6faf5
python           : 3.10.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Canada.1252

pandas           : 1.5.3
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

2.0.0

INSTALLED VERSIONS
------------------
commit           : 478d340667831908b5b4bf09a2787a11a14560c9
python           : 3.10.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Canada.1252

pandas           : 2.0.0
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

2.0.1

INSTALLED VERSIONS
------------------
commit           : 37ea63d540fd27274cad6585082c91b1283f963d
python           : 3.10.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Canada.1252

pandas           : 2.0.1
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

Comment From: phofl

This does not happen in concat, the previous consolidation happened in a self.values call when selecting columns, which was removed. I think this was intended @jbrockmendel ?

Comment From: jbrockmendel

getting rid of silent consolidation was definitely intentional. Where does the self.values lookup occur? We could warn on .values if a consolidation would improve perf (would be ugly but could even check for repeated .values checks)

using Cow gets a 3.5x speedup.

Comment From: phofl

Yes Copy-on-Write seems like a good solution here.

The self.values happened under the hood In _getitem_multilevel, but we got rid of it since then, dispatching to iloc now.

Comment From: phofl

That being said: This is a very specialised case, e.g. only one dtype. If you add a string column, the non-consolidation operation is actually faster, but both are considerably slower.

I'd recommend using CoW, performance is the same with multiple dtypes as with an single one!

Comment From: jbrockmendel

I'm guessing this goes through the _reindex_multi path. Two things to try out: 1) get rid of that path altogether, 2) only go through it in already-consolidated cases

Comment From: muttermachine

Thank you for the quick replies. One thing to note, extending the basic performance test shows worse performance with CoW enabled unless the manual consolidation is triggered.

I am no expert on the internals of Pandas, but if there is a useful/reasonable/hacky change for me to try and make to see if there is a reasonable approach to take, I would be willing to try it out.

2.0.1

Running once force_consolidate=False cow=False
0.18940009991638362
Running once again force_consolidate=False cow=False
0.1838280998636037
Running ten times force_consolidate=False cow=False
1.8617813000455499

Running once force_consolidate=False cow=True
0.29346469999291
Running once again force_consolidate=False cow=True
0.29061580006964505
Running ten times force_consolidate=False cow=True
3.1367222000844777

Running once force_consolidate=True cow=False
0.002467299811542034
Running once again force_consolidate=True cow=False
0.002005200134590268
Running ten times force_consolidate=True cow=False
0.020947899902239442

Running once force_consolidate=True cow=True
0.002265599789097905
Running once again force_consolidate=True cow=True
0.002035900019109249
Running ten times force_consolidate=True cow=True
0.01990539999678731

Modified code:

import itertools
import timeit

import numpy as np
import pandas as pd


def _debug(force_consolidate, cow):
    with pd.option_context('mode.copy_on_write', cow):
        df = _make_column_multilevel_df_via_concat(force_consolidate)
        level_a_cols = df.columns.unique('A')

        print(f'Running once {force_consolidate=} {cow=}')
        print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

        print(f'Running once again {force_consolidate=} {cow=}')
        print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

        print(f'Running ten times {force_consolidate=} {cow=}')
        print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=10))


def _make_column_multilevel_df_via_concat(force_consolidate):
    a_values = list(range(16))
    b_values = list(range(50))
    idx = pd.bdate_range('1991-01-01', '2023-12-31', name='date')
    template_frame = pd.DataFrame(np.zeros((len(idx), len(b_values))), index=idx, columns=pd.Index(b_values, name='B'))
    df = pd.concat({a: template_frame for a in a_values}, axis=1, names=['A'])

    if force_consolidate:
        df._consolidate_inplace()

    return df


def select_each_column(df, cols):
    # noinspection PyStatementEffect
    [df[c] for c in cols]


if __name__ == '__main__':
    pd.show_versions()
    for (force_consolidate, cow) in itertools.product([False, True], [False, True]):
        _debug(force_consolidate=force_consolidate, cow=cow)
        print()

Comment From: phofl

Yikes, I will look into this.

This is what I am getting on main:

Running once force_consolidate=False cow=False
0.007546624983660877
Running once again force_consolidate=False cow=False
0.004861583933234215
Running ten times force_consolidate=False cow=False
0.05707850004546344

Running once force_consolidate=False cow=True
0.001773166935890913
Running once again force_consolidate=False cow=True
0.0014914589701220393
Running ten times force_consolidate=False cow=True
0.0145230001071468

Running once force_consolidate=True cow=False
0.0015935830306261778
Running once again force_consolidate=True cow=False
0.0013050410198047757
Running ten times force_consolidate=True cow=False
0.013885915977880359

Running once force_consolidate=True cow=True
0.0016619160305708647
Running once again force_consolidate=True cow=True
0.001402582973241806
Running ten times force_consolidate=True cow=True
0.012952166958712041

Results on 2.0.1 is similar to yours. We should check whether we can backport the commit that improves performance here...

Comment From: muttermachine

Well the results on main do indeed look good under CoW! For my specific use case, given the root-change was intentional, am fine to manually consolidate where I need to for now and to spend the time on ensuring my codebase can switch to CoW safely.

If do not have a pandas dev env set-up but is there a way I could help track down the commit in question to make a backport easier? I presume git bisect and the like get expensive needing the full compile etc?

Comment From: muttermachine

Ah I see you've already tracked it down. Thank you again for all the help, quick replies, and indeed the work on Pandas generally!

Comment From: phofl

Thanks for offering. Already found it. Bisects are expensive yes, but depends a bit on your hardware.