Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
N = 1500
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
%timeit slow_df['a'].nlargest()
fast_df = pd.DataFrame({'a': np.arange(N)})
%timeit fast_df['a'].nlargest()
results in
448 ms ± 9.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
105 µs ± 850 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
so 4000 times slower with a non unique index. This was very surprising.
Installed Versions
Prior Performance
No response
Comment From: paddymul
Some additional examples that I tried. It's not just pd.NA
that causes the problem. having a random array isn't significantly slower than an ordered array. Also, the performance scales badly (BigO unknown) with the number of duplicates, not the total size (look at med_df
.
N = 1500
N_DOUBLE = 3000
N_HALF = 750
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
print("slow_df")
%timeit slow_df['a'].nlargest()
slow_df2 = pd.DataFrame({'a': np.arange(N_DOUBLE)}, index=np.concatenate([[pd.NA] * N, np.arange(N)]))
print("slow_df2")
%timeit slow_df2['a'].nlargest()
slow_df3 = pd.DataFrame({'a': np.random.rand(N_DOUBLE)}, index=np.concatenate([[pd.NA] * N, np.arange(N)]))
print("slow_df3")
%timeit slow_df3['a'].nlargest()
slow_df4 = pd.DataFrame({'a': np.random.rand(N_DOUBLE)}, index=np.concatenate([[1] * N, np.arange(N)]))
print("slow_df4")
%timeit slow_df4['a'].nlargest()
med_df = pd.DataFrame({'a': np.random.rand(N)}, index=np.concatenate([[1] * N_HALF, np.arange(N_HALF)]))
print("med_df")
%timeit med_df['a'].nlargest()
fast_df = pd.DataFrame({'a': np.arange(N)})
print("fast_df")
%timeit fast_df['a'].nlargest()
fast_df2 = pd.DataFrame({'a': np.random.rand(N)})
print("fast_df2")
%timeit fast_df2['a'].nlargest()
results in
slow_df
445 ms ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df2
444 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df3
439 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df4
449 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
med_df
18.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fast_df
105 µs ± 616 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
fast_df2
121 µs ± 553 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Comment From: paddymul
profiling output
N = 1500
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
slow_ser = slow_df['a']
%prun -s cumtime slow_ser.nlargest()
results in
4063 function calls (4053 primitive calls) in 0.669 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.669 0.669 {built-in method builtins.exec}
1 0.000 0.000 0.669 0.669 <string>:1(<module>)
1 0.000 0.000 0.669 0.669 series.py:4006(nlargest)
1 0.000 0.000 0.669 0.669 selectn.py:55(nlargest)
1 0.000 0.000 0.669 0.669 selectn.py:90(compute)
1 0.000 0.000 0.668 0.668 series.py:5047(drop)
1 0.000 0.000 0.668 0.668 generic.py:4680(drop)
1 0.004 0.004 0.668 0.668 generic.py:4719(_drop_axis)
1 0.000 0.000 0.658 0.658 base.py:6076(get_indexer_for)
1 0.000 0.000 0.658 0.658 base.py:6036(get_indexer_non_unique)
1 0.181 0.181 0.657 0.657 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}
225 0.475 0.002 0.476 0.002 fromnumeric.py:1407(resize)
1 0.000 0.000 0.003 0.003 common.py:261(index_labels_to_array)
3 0.000 0.000 0.003 0.001 common.py:228(asarray_tuplesafe)
10 0.003 0.000 0.003 0.000 {built-in method numpy.asarray}
1 0.000 0.000 0.001 0.001 base.py:6467(isin)
2 0.000 0.000 0.001 0.000 base.py:477(__new__)
1 0.001 0.001 0.001 0.001 algorithms.py:457(isin)
2 0.000 0.000 0.001 0.000 base.py:7512(ensure_index)
225 0.000 0.000 0.001 0.000 fromnumeric.py:200(reshape)
Comment From: paddymul
pandas/core/methods/selectn.py#L101-L102
dropped = self.obj.dropna()
nan_index = self.obj.drop(dropped.index)
are the offending lines. Particularly nan_index
Would it work/be faster to do this 1. compute a temporary integer based index 2. run smallest/largest on the new series 3. grab the actual index values corresponding with the integer based index 4. use those index values in the return series?
https://github.com/pandas-dev/pandas/blob/main/pandas/core/methods/selectn.py#L101-L102
Comment From: lukemanley
Thanks for the report. I think you could probably change those two lines to something like this to avoid the bottleneck:
mask = self.obj.isna()
dropped = self.obj[~mask]
nan_index = self.obj[mask]
If you would like to open a PR go for it. Otherwise I can give this a try.
Comment From: paddymul
I would like to try to tackle this. I'll try to get to this early in the coming week.
Comment From: Jeffrharr
It's been a while since @paddymul replied. I'd like to give this one a shot.
I've got a fix and can have tests and submit the PR on Monday or Tuesday.
Comment From: chilin0525
@Jeffrharr Hi, you can assign issue to yourself by just adding "take" only comment
Comment From: Jeffrharr
take
Comment From: Jeffrharr
@chilin0525 thanks for the information. Anything I should do to find a reviewer? This is my first PR in pandas.