Hi,
I came across an unknown-to-me behaviour in the performance of groupby
(I was suggested to put the question here after a thread on stackoverflow.
import pandas as pd
import numpy as np
a = np.random.choice(['a', 'A'], size=100000)
b = np.random.choice(range(10000), size=100000)
df = pd.DataFrame(data=dict(A=a, B=b))
The performance between the genuine nunique
function and the 'lambda version' if it is totally different
%%timeit
df_ = df.groupby(by='B').A.nunique()
10 loops, best of 3: 60.9 ms per loop
%%timeit
df_ = df.groupby(by='B').agg(
dict(
A=lambda s: s.nunique()
)
)
1 loop, best of 3: 981 ms per loop
Is it an expected behaviour?
Best regards, F.
Comment From: jorisvandenbossche
I think this performance difference is expected (of course, it can always be that each of both implementations can be optimized).
nunique
is a specific method that is optimized to do what it has to do, while passing a function to agg
is a very generic method to deal with all kinds of functions.
Comment From: jorisvandenbossche
What is maybe surprising is the large difference when applying a generic function in agg
depending on the object vs categorical column what you report in the SO question.
Comment From: jreback
this is not surprising at all :)
when you agg/apply with a lambda, pandas must interate over the individual groups, thus you have O(num_of_groups)
with an O(function)
complexity.
.nunique
has a particularly optimized impl for an entire pandas object as this is part of calculating the groupby in the first place, so the difference here is more stark.
Using an explicity function is ALWAYS, full-stop, preferred.
Comment From: FlavienMakesInfographics
@jreback Would you also have an explanation for the difference of performance with lambda
functions when applied to object and categorical columns, as I was asking on stackoverflow? Thanks.
Comment From: jreback
that is one of the benefits of a categorical
:)
we already know the uniques (the categories), so counting them is easy, compared to first finding & then comparing for object
Comment From: jorisvandenbossche
@jreback the categorical is a lot slower in this case ... (although the nunique
itself is faster on a categorical, applying it in a lambda and combining the result in a a groupby for a categorical column is slower)
Comment From: jorisvandenbossche
The example of SO:
In [48]: a = np.random.choice(['a', 'A'], size=100000)
...: b = np.random.choice(range(10000), size=100000)
...:
...: df = pd.DataFrame(data=dict(A1=a, A2=a, B=b))
...: df['A2'] = df.A2.astype('category')
In [50]: %tmeit df['A1'].nunique()
ERROR:root:Line magic function `%tmeit` not found.
In [51]: %timeit df['A1'].nunique()
100 loops, best of 3: 4.71 ms per loop
In [52]: %timeit df['A2'].nunique()
1000 loops, best of 3: 1.32 ms per loop
In [53]: %timeit df.groupby(by='B').agg(dict(A1=lambda s: s.nunique()))
1 loop, best of 3: 1.7 s per loop
In [54]: %timeit df.groupby(by='B').agg(dict(A2=lambda s: s.nunique()))
1 loop, best of 3: 5.52 s per loop
Comment From: jorisvandenbossche
Which is probably cause by the fact that each of the groups is rather small, and for smaller frames, a categorical nunique is not faster but shows some overhead:
In [58]: group1 = df.groupby(by='B').get_group(1)
In [59]: group1
Out[59]:
A1 A2 B
2618 a a 1
11259 a a 1
15889 a a 1
47883 a a 1
60714 a a 1
82558 a a 1
98286 A A 1
In [60]: %timeit group1['A1'].nunique()
10000 loops, best of 3: 69.8 µs per loop
In [61]: %timeit group1['A2'].nunique()
1000 loops, best of 3: 361 µs per loop
@FlavienMakesInfographics so that is the reason it is slower. But in any case, you shouldn't use agg
in this case with a lambda function
Comment From: FlavienMakesInfographics
@jorisvandenbossche Thank you very much for the explanation and the piece of advice.