Pandas PERF: Surprisingly slow nlargest with duplicates in the index

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

N = 1500
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
%timeit slow_df['a'].nlargest()

fast_df = pd.DataFrame({'a': np.arange(N)})
%timeit fast_df['a'].nlargest()

results in

448 ms ± 9.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
105 µs ± 850 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

so 4000 times slower with a non unique index. This was very surprising.

Installed Versions

INSTALLED VERSIONS ------------------ commit : e86ed377639948c64c429059127bcf5b359ab6be python : 3.11.5.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Fri Dec 16 00:34:59 PST 2022; root:xnu-7195.141.49~1/RELEASE_ARM64_T8101 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.1 numpy : 1.26.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.0.0 pip : 23.2.1 Cython : None pytest : 7.4.2 hypothesis : None sphinx : 7.2.6 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.16.1 pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.0 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Prior Performance

No response

Comment From: paddymul

Some additional examples that I tried. It's not just pd.NA that causes the problem. having a random array isn't significantly slower than an ordered array. Also, the performance scales badly (BigO unknown) with the number of duplicates, not the total size (look at med_df.

N = 1500
N_DOUBLE = 3000
N_HALF = 750
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
print("slow_df")
%timeit slow_df['a'].nlargest()

slow_df2 = pd.DataFrame({'a': np.arange(N_DOUBLE)}, index=np.concatenate([[pd.NA] * N, np.arange(N)]))
print("slow_df2")
%timeit slow_df2['a'].nlargest()

slow_df3 = pd.DataFrame({'a':  np.random.rand(N_DOUBLE)}, index=np.concatenate([[pd.NA] * N, np.arange(N)]))
print("slow_df3")
%timeit slow_df3['a'].nlargest()

slow_df4 = pd.DataFrame({'a':  np.random.rand(N_DOUBLE)}, index=np.concatenate([[1] * N, np.arange(N)]))
print("slow_df4")
%timeit slow_df4['a'].nlargest()

med_df = pd.DataFrame({'a':  np.random.rand(N)}, index=np.concatenate([[1] * N_HALF, np.arange(N_HALF)]))
print("med_df")
%timeit med_df['a'].nlargest()


fast_df = pd.DataFrame({'a': np.arange(N)})
print("fast_df")
%timeit fast_df['a'].nlargest()

fast_df2 = pd.DataFrame({'a': np.random.rand(N)})
print("fast_df2")
%timeit fast_df2['a'].nlargest()

results in

slow_df
445 ms ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df2
444 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df3
439 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slow_df4
449 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
med_df
18.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fast_df
105 µs ± 616 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
fast_df2
121 µs ± 553 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Comment From: paddymul

profiling output

N = 1500
slow_df = pd.DataFrame({'a': np.arange(N)}, index=[pd.NA] * N)
slow_ser = slow_df['a']
%prun  -s cumtime slow_ser.nlargest()

results in



         4063 function calls (4053 primitive calls) in 0.669 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.669    0.669 {built-in method builtins.exec}
        1    0.000    0.000    0.669    0.669 <string>:1(<module>)
        1    0.000    0.000    0.669    0.669 series.py:4006(nlargest)
        1    0.000    0.000    0.669    0.669 selectn.py:55(nlargest)
        1    0.000    0.000    0.669    0.669 selectn.py:90(compute)
        1    0.000    0.000    0.668    0.668 series.py:5047(drop)
        1    0.000    0.000    0.668    0.668 generic.py:4680(drop)
        1    0.004    0.004    0.668    0.668 generic.py:4719(_drop_axis)
        1    0.000    0.000    0.658    0.658 base.py:6076(get_indexer_for)
        1    0.000    0.000    0.658    0.658 base.py:6036(get_indexer_non_unique)
        1    0.181    0.181    0.657    0.657 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}
      225    0.475    0.002    0.476    0.002 fromnumeric.py:1407(resize)
        1    0.000    0.000    0.003    0.003 common.py:261(index_labels_to_array)
        3    0.000    0.000    0.003    0.001 common.py:228(asarray_tuplesafe)
       10    0.003    0.000    0.003    0.000 {built-in method numpy.asarray}
        1    0.000    0.000    0.001    0.001 base.py:6467(isin)
        2    0.000    0.000    0.001    0.000 base.py:477(__new__)
        1    0.001    0.001    0.001    0.001 algorithms.py:457(isin)
        2    0.000    0.000    0.001    0.000 base.py:7512(ensure_index)
      225    0.000    0.000    0.001    0.000 fromnumeric.py:200(reshape)

Comment From: paddymul

pandas/core/methods/selectn.py#L101-L102

        dropped = self.obj.dropna()
        nan_index = self.obj.drop(dropped.index)

are the offending lines. Particularly nan_index

Would it work/be faster to do this 1. compute a temporary integer based index 2. run smallest/largest on the new series 3. grab the actual index values corresponding with the integer based index 4. use those index values in the return series?

https://github.com/pandas-dev/pandas/blob/main/pandas/core/methods/selectn.py#L101-L102

Comment From: lukemanley

Thanks for the report. I think you could probably change those two lines to something like this to avoid the bottleneck:

mask = self.obj.isna()
dropped = self.obj[~mask]
nan_index = self.obj[mask]

If you would like to open a PR go for it. Otherwise I can give this a try.

Comment From: paddymul

I would like to try to tackle this. I'll try to get to this early in the coming week.

Comment From: Jeffrharr

It's been a while since @paddymul replied. I'd like to give this one a shot.

I've got a fix and can have tests and submit the PR on Monday or Tuesday.

Comment From: chilin0525

@Jeffrharr Hi, you can assign issue to yourself by just adding "take" only comment

Comment From: Jeffrharr

take

Comment From: Jeffrharr

@chilin0525 thanks for the information. Anything I should do to find a reviewer? This is my first PR in pandas.