Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
from tqdm import tqdm
df = pd.DataFrame(np.random.random((5000,15000)))
df.columns = df.columns.astype("str")
col_0_np = df.iloc[:, 0].to_numpy()
for idx in tqdm(df.columns):
df[idx] = col_0_np
0%| | 20/15000 [00:04<57:50, 4.32it/s]
Installed Versions
Prior Performance
In 1.3.5, the code should complete within seconds.
Comment From: Hubedge
Works fine on both 1.4.1 and 1.3.5 with df.loc[:, idx] = col_0_np:
import numpy as np
import pandas as pd
from tqdm import tqdm
df = pd.DataFrame(np.random.random((5000,15000)))
df.columns = df.columns.astype("str")
col_0_np = df.iloc[:, 0].to_numpy()
for idx in tqdm(df.columns):
df.loc[:, idx] = col_0_np
100%|██████████| 15000/15000 [00:04<00:00, 3246.66it/s]
Comment From: phofl
The performance regression is caused by a change in behavior underneath. Before 1.4 the setitem operation wrote into the underlying numpy array in place -> if you do
na = np.random.random((2,15))
df = pd.DataFrame(na)
df.columns = df.columns.astype("str")
col_0_np = df.iloc[:, 1].to_numpy()
df[df.columns[0]] = col_0_np
na was changed after the setitem call. This makes now a copy of the underlying array, hence the slowdown. If this is not relevant for you, I would suggest using loc. I am not sure if we can speed this up easily
cc @jbrockmendel
Comment From: Hubedge
@phofl
In this case whether the original na is modified is not relevant, but I'm curious the motivation to make the change, because I would assume that people are more used to setitem than to loc, and that they assume setitem does not copy.
Can you share the link to the discussion, if any?
Comment From: phofl
That might have been misleading on my side.
Setitem was supposed to make a copy while loc was supposed to operate inplace. Before 1.4 this was inconsistent based on the code paths this was running through. This should be more consistent now, hence the change above. Technically speaking, the fact that this was modifying the underlying array before 1.4 was a bug.
Comment From: Hubedge
It's fine as long as it's by design (and it would be even better if this difference is well documented).
Let's see if @jbrockmendel has anything to comment on improving performance. Otherwise, feel free to close the issue.
Comment From: jbrockmendel
https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v1.3.0.rst#never-operate-inplace-when-setting-framekeys--values
df.loc[:, foo] = bar tries to write into the existing array(s), so should be fast (though we're still not totally consistent about this xref #45333). df[foo] = bar will never write into the existing array(s).
I think there may be a perf improvement available by changing Block.delete to not call np.delete. Not sure if that'd be relevant here.
Comment From: phofl
Yeah that would help. Most of the time is spent in np.delete
Comment From: sappersapper
Another performance regression example when assigning value to df[col_names]:
import time
import pandas as pd
import numpy as np
n = 2000
columns = list(range(n))
df = pd.DataFrame(np.ones([n, n]), columns=columns)
start = time.time()
df[columns] = df[columns]
print(time.time() - start)
for pandas 1.4.2, about 7.6s for pandas 1.3.4, about 0.2s
It is ok when assigning with loc (about 0.05s):
start = time.time()
df.loc[:, columns] = df[columns]
print(time.time() - start)
Comment From: jtilly
I would like to point out that .assign(...) has seen a performance regression in 1.4.0 that seems to go beyond what can be explained by changes in setitem:
import pandas as pd
import numpy as np
from time import perf_counter
k = 500 # number of columns
n = 10000 # number of rows
r = 10 # number of repetitions
data = {f"col{key}": np.random.randint(0, 5, size=n) for key in range(k)}
df = pd.DataFrame(data)
print(f"{pd.__version__=}")
# use []
t0 = perf_counter()
for _ in range(r):
for key in data:
df[key] = data[key]
t1 = perf_counter()
print(f"[] {(t1-t0)/r:.4f}")
# use copy + .loc
t0 = perf_counter()
for _ in range(r):
for key in data:
df.loc[:, key] = df.loc[:, key].copy()
df.loc[:, key] = data[key]
t1 = perf_counter()
print(f"copy + .loc {(t1-t0)/r:.4f}")
# use assign
t0 = perf_counter()
for _ in range(r):
df = df.assign(**data)
t1 = perf_counter()
print(f"assign {(t1-t0)/r:.4f}")
Output:
pd.__version__='1.3.5'
[] 0.0203
copy + .loc 0.2112
assign 0.0329
pd.__version__='1.4.0'
[] 0.2501
copy + .loc 0.1436
assign 1.9943
pd.__version__='1.4.1'
[] 0.1976
copy + .loc 0.1464
assign 1.9205
pd.__version__='1.4.2'
[] 0.1916
copy + .loc 0.1519
assign 2.0024
pd.__version__='1.4.3'
[] 0.2191
copy + .loc 0.1347
assign 1.9111
pd.__version__='1.5.0.dev0+1030.g7d2f9b8d59'
[] 0.2060
copy + .loc 0.1269
assign 1.8481
Comment From: lbittarello
It's also worth noting that it's difficult to use .loc when chaining operations, unlike assign, so it can be a lot clunkier.
Comment From: simonjayhawkins
removing from 1.4.x milestone