groupby.first() is much slower than groupby.nth(0) for categorical columns in a very specific way.
Consider carefully the example below, where a dataframe has two columns c1 and c2. The number of unique values in the c1 column (regardless of its datatype) increases the runtime of first when and only when c2 is a categorical column.
import pandas as pd, numpy as np, timeit
def test(N_CATEGORIES, cat_cols = ['c2']):
# creates dataframe df with categorical column c1, optional categorical column c2.
# Times how long grouping by c1 and calling nth(0) and first takes
global df
print(N_CATEGORIES)
df = pd.DataFrame({
'c1':np.arange(0,10000)%N_CATEGORIES,
'c2':np.arange(0,10000)
})
for col in cat_cols:
df[col] = df[col].astype('category')
t_nth = timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from __main__ import df", number=1)
t_first = timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from __main__ import df", number=1)
return t_nth, t_first
test_N_categories = [1,10,100,1000,10000]
# Test when column c2 is categorical
results_c2_cat = pd.DataFrame([test(N, cat_cols=['c2']) for N in test_N_categories],
index=test_N_categories,
columns=['nth','first'])
# Test when column c2 is not categorical
results_c2_not_cat = pd.DataFrame([test(N, cat_cols=[]) for N in test_N_categories],
index=test_N_categories,
columns=['nth','first'])
print(results_c2_cat)
print(results_c2_not_cat)
The results are
>>> print(results_c2_cat)
nth first
1 0.004204 0.010677
10 0.005890 0.015701
100 0.005305 0.052130
1000 0.004365 0.581036
10000 0.004358 2.847709
>>> print(results_c2_not_cat)
nth first
1 0.003479 0.001110
10 0.003027 0.000993
100 0.003297 0.001089
1000 0.003297 0.001057
10000 0.003952 0.001382
As shown, when the column c2 is categorical, the runtime of first grows very rapidly as a function of the number of unique values in the column c1 (and thus the "width" of the groupby).
first and last both suffer from this problem.
(see also https://github.com/pandas-dev/pandas/issues/19283 and https://github.com/pandas-dev/pandas/issues/19598)
Output of pd.show_versions()
Comment From: jreback
first/last are implemented in cython; cats are converted to object hence the slowdown nth is basically an analytical solution so it will scale better (also first/last) don’t fully handle all dtypes
not averse to just having first/last just call nth
Comment From: joseortiz3
nth has problems handling missing values (#26011)
As a workaround, here is a function that you can rely on to get the job done fast and correctly:
def groupby_first_fast(df, groupby):
"""Pandas groupby.first() is slow due to a conversion.
Meanwhile groupby.nth(0) has a bug with missing values.
This takes care of both problems.
Issues #25397 and #26011"""
# Get rid of missing values that break nth.
df = df[~(df[groupby].isnull())]
# Use nth instead of first for speed.
return df.groupby(groupby).nth(0)
Edit: Didn't mean to push "close" button.
Comment From: jbrockmendel
Looks like this got fixed by #52120, closing.