Pandas PERF: bottleneck in where() - Nineya|java/go/python

Pandas version checks

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this issue exists on the latest version of pandas.
[x] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(1, 1_000_000))

mask = df > 0.5

%%timeit
_ = df.where(mask)
# 693 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

perf result taken from pyinstrument:

This issue seems to be related to this: https://github.com/pandas-dev/pandas/blob/d1ec1a4c9b58a9ebff482af2b918094e39d87893/pandas/core/generic.py#L9735-L9737

When dataframe is large, this overhead of is_bool_dtype accumulates. Would it be better to use cond.dtypes.unique() instead?

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.10.14 python-bits : 64 OS : Linux OS-release : 5.15.0-122-generic Version : #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0 pip : 24.0 Cython : 3.0.7 sphinx : 7.3.7 IPython : 8.25.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.4 lxml.etree : None matplotlib : 3.9.2 numba : 0.60.0 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : 2.9.9 pymysql : 1.4.6 pyarrow : 16.1.0 pyreadstat : None pytest : 8.2.2 python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : 2.0.31 tables : 3.9.2 tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : 0.22.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None

Prior Performance

No response

Comment From: samukweku

Hi @auderson do U have a benchmark U r comparing against?

Comment From: auderson

@samukweku No, but I do find the % of time spent by is_bool_dtype increases with the num of columns.

Comment From: samukweku

a crude benchmark, comparing against numpy:

np.random.seed(3)
df = pd.DataFrame(np.random.randn(1, 1_000_000))
mask = df > 0.5

%timeit pd.DataFrame(np.where(mask, df, np.nan), columns= df.columns)
573 μs ± 102 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

 %timeit df.where(mask)
324 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pd.DataFrame(np.where(mask, df, np.nan), columns= df.columns).equals(df.where(mask))
Out[14]: True

Comment From: samukweku

@auderson i think there is an expense associated with .unique though, so i'm not sure it would offer better performance. maybe you could test it locally?

Comment From: auderson

@samukweku

From my local machine the performance increases by 10x if using .dtypes.unique()

But a better solution is to get the dtypes of each block, instead of dtypes of each column, then .dtypes.unique() will not be needed. e.g. [blk.dtype for blk in df._mgr.blocks]

I'm not familiar with pandas internals so not sure if the above usage is OK