Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
df.to_dict("records")
is slow compared to a purely Python implementation.
For example
#!/usr/bin/env python
import numpy as np
import pandas as pd
import time
def to_dict_pandas(df):
return df.to_dict("records")
def to_dict_custom(df):
cols = list(df)
col_arr_map = {col: df[col].astype(object).to_numpy() for col in cols}
records = []
for i in range(len(df)):
record = {col: col_arr_map[col][i] for col in cols}
records.append(record)
return records
def main():
f8_cols = "ABC"
i8_cols = "DEF"
str_cols = "GHI"
n = 5_000_000
rs = np.random.RandomState(42)
df_data = {
**{f8_col: rs.random(n) for f8_col in f8_cols},
**{i8_col: rs.randint(-1e9, 1e9, n) for i8_col in i8_cols},
**{str_col: rs.choice(["LONG STRING" * 5, "SHORT STRING"], n) for str_col in str_cols},
}
df = pd.DataFrame(df_data)
print(df)
print(df.dtypes)
t1 = time.time()
records_pandas = to_dict_pandas(df)
t2 = time.time()
print(f"Pandas took: {t2-t1:,.2f}s")
t1 = time.time()
records_custom = to_dict_custom(df)
t2 = time.time()
assert records_pandas[0] == records_custom[0]
print(f"Custom took: {t2-t1:,.2f}s")
if __name__ == "__main__":
main()
I get
Pandas took: 34.32s
Custom took: 10.32s
Seems to spend most of it's time in maybe_box_native. Which probably could be avoided becuase we could determine the dtype of each column once at the start
Installed Versions
Prior Performance
No response
Comment From: rhshadrach
Thanks for the report, I've confirmed the same behavior on main. It sounds like you may be onto the source of the issue, would be you be interested in submitting a PR to fix @RogerThomas?
Comment From: RogerThomas
@rhshadrach sure, I'll give it a whirl
Comment From: phofl
I think this is a bit more complicated. In object dtype columns you can have many values, hence you have to check nevertheless. But you might be able to exclude all numeric columns from the cast
Comment From: RogerThomas
@phofl i think my pr does basically does that, would you have time to take a look?
Comment From: phofl
Could you link it here? E.g. setting the closes field when opening the pr
Comment From: RogerThomas
Sure,
https://github.com/pandas-dev/pandas/pull/46487/files