Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

df.to_dict("records") is slow compared to a purely Python implementation. For example

#!/usr/bin/env python
import numpy as np
import pandas as pd
import time


def to_dict_pandas(df):
    return df.to_dict("records")


def to_dict_custom(df):
    cols = list(df)
    col_arr_map = {col: df[col].astype(object).to_numpy() for col in cols}
    records = []
    for i in range(len(df)):
        record = {col: col_arr_map[col][i] for col in cols}
        records.append(record)
    return records


def main():
    f8_cols = "ABC"
    i8_cols = "DEF"
    str_cols = "GHI"

    n = 5_000_000

    rs = np.random.RandomState(42)

    df_data = {
        **{f8_col: rs.random(n) for f8_col in f8_cols},
        **{i8_col: rs.randint(-1e9, 1e9, n) for i8_col in i8_cols},
        **{str_col: rs.choice(["LONG STRING" * 5, "SHORT STRING"], n) for str_col in str_cols},
    }
    df = pd.DataFrame(df_data)

    print(df)
    print(df.dtypes)

    t1 = time.time()
    records_pandas = to_dict_pandas(df)
    t2 = time.time()
    print(f"Pandas took: {t2-t1:,.2f}s")

    t1 = time.time()
    records_custom = to_dict_custom(df)
    t2 = time.time()

    assert records_pandas[0] == records_custom[0]
    print(f"Custom took: {t2-t1:,.2f}s")


if __name__ == "__main__":
    main()

I get

Pandas took: 34.32s
Custom took: 10.32s

Seems to spend most of it's time in maybe_box_native. Which probably could be avoided becuase we could determine the dtype of each column once at the start

Installed Versions

INSTALLED VERSIONS ------------------ commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.7.13.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 1.3.5 numpy : 1.21.1 pytz : 2021.1 dateutil : 2.8.2 pip : 22.0.4 setuptools : 57.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.3.7 lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.3 (dt dec pq3 ext lo64) jinja2 : 3.0.1 IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : 2021.07.0 fastparquet : None gcsfs : None matplotlib : 3.5.0 numexpr : None odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : 4.0.1 pyxlsb : None s3fs : None scipy : 1.7.0 sqlalchemy : 1.4.32 tables : None tabulate : 0.8.9 xarray : None xlrd : 1.1.0 xlwt : None numba : 0.53.1

Prior Performance

No response

Comment From: rhshadrach

Thanks for the report, I've confirmed the same behavior on main. It sounds like you may be onto the source of the issue, would be you be interested in submitting a PR to fix @RogerThomas?

Comment From: RogerThomas

@rhshadrach sure, I'll give it a whirl

Comment From: phofl

I think this is a bit more complicated. In object dtype columns you can have many values, hence you have to check nevertheless. But you might be able to exclude all numeric columns from the cast

Comment From: RogerThomas

@phofl i think my pr does basically does that, would you have time to take a look?

Comment From: phofl

Could you link it here? E.g. setting the closes field when opening the pr

Comment From: RogerThomas

Sure,

https://github.com/pandas-dev/pandas/pull/46487/files