Pandas DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes

groupby.first() is much slower than groupby.nth(0) for categorical columns in a very specific way.

Consider carefully the example below, where a dataframe has two columns c1 and c2. The number of unique values in the c1 column (regardless of its datatype) increases the runtime of first when and only when c2 is a categorical column.

import pandas as pd, numpy as np, timeit

def test(N_CATEGORIES, cat_cols = ['c2']):
    # creates dataframe df with categorical column c1, optional categorical column c2.
    # Times how long grouping by c1 and calling nth(0) and first takes
    global df 
    print(N_CATEGORIES)
    df = pd.DataFrame({
        'c1':np.arange(0,10000)%N_CATEGORIES,
        'c2':np.arange(0,10000)
        })
    for col in cat_cols:
        df[col] = df[col].astype('category')
    t_nth = timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from __main__ import df", number=1)
    t_first = timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from __main__ import df", number=1)
    return t_nth, t_first

test_N_categories = [1,10,100,1000,10000]

# Test when column c2 is categorical
results_c2_cat = pd.DataFrame([test(N, cat_cols=['c2']) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

# Test when column c2 is not categorical
results_c2_not_cat = pd.DataFrame([test(N, cat_cols=[]) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

print(results_c2_cat)
print(results_c2_not_cat)

The results are

>>> print(results_c2_cat)
            nth     first
1      0.004204  0.010677
10     0.005890  0.015701
100    0.005305  0.052130
1000   0.004365  0.581036
10000  0.004358  2.847709
>>> print(results_c2_not_cat)
            nth     first
1      0.003479  0.001110
10     0.003027  0.000993
100    0.003297  0.001089
1000   0.003297  0.001057
10000  0.003952  0.001382

As shown, when the column c2 is categorical, the runtime of first grows very rapidly as a function of the number of unique values in the column c1 (and thus the "width" of the groupby).

first and last both suffer from this problem.

(see also https://github.com/pandas-dev/pandas/issues/19283 and https://github.com/pandas-dev/pandas/issues/19598)

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.24.1 pytest: None pip: 19.0.1 setuptools: 39.0.1 Cython: None numpy: 1.15.3 scipy: 1.1.0 pyarrow: 0.12.0 xarray: 0.11.0 IPython: None sphinx: None patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.6 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.8 feather: 0.4.0 matplotlib: 3.0.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 4.2.6 bs4: 4.6.3 html5lib: None sqlalchemy: 1.2.17 pymysql: None psycopg2: 2.7.7 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Comment From: jreback

first/last are implemented in cython; cats are converted to object hence the slowdown nth is basically an analytical solution so it will scale better (also first/last) don’t fully handle all dtypes

not averse to just having first/last just call nth

Comment From: joseortiz3

nth has problems handling missing values (#26011)

As a workaround, here is a function that you can rely on to get the job done fast and correctly:

def groupby_first_fast(df, groupby):
    """Pandas groupby.first() is slow due to a conversion.
    Meanwhile groupby.nth(0) has a bug with missing values. 
    This takes care of both problems.
    Issues #25397 and #26011"""
    # Get rid of missing values that break nth.
    df = df[~(df[groupby].isnull())]
    # Use nth instead of first for speed.
    return df.groupby(groupby).nth(0)

Edit: Didn't mean to push "close" button.

Comment From: jbrockmendel

Looks like this got fixed by #52120, closing.

Pandas DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes

Output of pd.show_versions()

Output of `pd.show_versions()`