Pandas Performance of lambda function in groupby

Hi,

I came across an unknown-to-me behaviour in the performance of groupby (I was suggested to put the question here after a thread on stackoverflow.

import pandas as pd
import numpy as np

a = np.random.choice(['a', 'A'], size=100000)
b = np.random.choice(range(10000), size=100000)

df = pd.DataFrame(data=dict(A=a, B=b))

The performance between the genuine nunique function and the 'lambda version' if it is totally different

%%timeit
df_ = df.groupby(by='B').A.nunique()

10 loops, best of 3: 60.9 ms per loop

%%timeit
df_ = df.groupby(by='B').agg(
    dict(
        A=lambda s: s.nunique()
    )
)

1 loop, best of 3: 981 ms per loop

Is it an expected behaviour?

Best regards, F.

Comment From: jorisvandenbossche

I think this performance difference is expected (of course, it can always be that each of both implementations can be optimized). nunique is a specific method that is optimized to do what it has to do, while passing a function to agg is a very generic method to deal with all kinds of functions.

Comment From: jorisvandenbossche

What is maybe surprising is the large difference when applying a generic function in agg depending on the object vs categorical column what you report in the SO question.

Comment From: jreback

this is not surprising at all :)

when you agg/apply with a lambda, pandas must interate over the individual groups, thus you have O(num_of_groups) with an O(function) complexity.

.nunique has a particularly optimized impl for an entire pandas object as this is part of calculating the groupby in the first place, so the difference here is more stark.

Using an explicity function is ALWAYS, full-stop, preferred.

Comment From: FlavienMakesInfographics

@jreback Would you also have an explanation for the difference of performance with lambda functions when applied to object and categorical columns, as I was asking on stackoverflow? Thanks.

Comment From: jreback

that is one of the benefits of a categorical :)

we already know the uniques (the categories), so counting them is easy, compared to first finding & then comparing for object

Comment From: jorisvandenbossche

@jreback the categorical is a lot slower in this case ... (although the nunique itself is faster on a categorical, applying it in a lambda and combining the result in a a groupby for a categorical column is slower)

Comment From: jorisvandenbossche

The example of SO:


In [48]: a = np.random.choice(['a', 'A'], size=100000)
    ...: b = np.random.choice(range(10000), size=100000)
    ...:
    ...: df = pd.DataFrame(data=dict(A1=a, A2=a, B=b))
    ...: df['A2'] = df.A2.astype('category')

In [50]: %tmeit df['A1'].nunique()
ERROR:root:Line magic function `%tmeit` not found.

In [51]: %timeit df['A1'].nunique()
100 loops, best of 3: 4.71 ms per loop

In [52]: %timeit df['A2'].nunique()
1000 loops, best of 3: 1.32 ms per loop

In [53]:  %timeit df.groupby(by='B').agg(dict(A1=lambda s: s.nunique()))
1 loop, best of 3: 1.7 s per loop

In [54]: %timeit df.groupby(by='B').agg(dict(A2=lambda s: s.nunique()))
1 loop, best of 3: 5.52 s per loop

Comment From: jorisvandenbossche

Which is probably cause by the fact that each of the groups is rather small, and for smaller frames, a categorical nunique is not faster but shows some overhead:

In [58]: group1 = df.groupby(by='B').get_group(1)

In [59]: group1
Out[59]:
      A1 A2  B
2618   a  a  1
11259  a  a  1
15889  a  a  1
47883  a  a  1
60714  a  a  1
82558  a  a  1
98286  A  A  1

In [60]: %timeit group1['A1'].nunique()
10000 loops, best of 3: 69.8 µs per loop

In [61]: %timeit group1['A2'].nunique()
1000 loops, best of 3: 361 µs per loop

@FlavienMakesInfographics so that is the reason it is slower. But in any case, you shouldn't use agg in this case with a lambda function

Comment From: FlavienMakesInfographics

@jorisvandenbossche Thank you very much for the explanation and the piece of advice.