Pandas ENH: allow 'size' in groupby aggregation

Allow to use 'size' in groupby's aggregate, so you can do:

df.groupby(..).agg('size') 
df.groupny(..).agg(['mean', 'size'])

http://stackoverflow.com/questions/21660686/pandas-groupby-straight-forward-counting-the-number-of-elements-in-each-group-i - ~~count should directly implement size (enh)~~ - count/size should be allowed in an aggregation list (the bug)

Comment From: eldad-a

Please note, the current (slow) count is allowed in the aggregation list

Comment From: hayd

so count is non-null values, which goes someway to explain why it is slower.

Comment From: jreback

yep...this is a very easy fix (just alias count to size) as its already comptued by the group indexer

Comment From: hayd

What I mean is, count is a different operation to size, size just cares about the result_index whilst count cares about whether values are non-null in columns... (same thing with vaue_counts, sometime user may want to count at values in another column).

Comment From: jorisvandenbossche

@jreback This issue is not really clear to me, as size does already exist for groupby? Or do you mean that you want to be able to do df.groupby(..).agg('size') instead of df.groupby(..).size() (and therefore be able to do agg(['mean', 'size']) ?

And I think we don't want to "alias count to size", as size does something different as count.

Comment From: jreback

no i think we just need to alias size (like we do mean). iow add it to the cython table i think (this might work now)

Comment From: jorisvandenbossche

@jreback updated top post to clarify the issue

Comment From: jreback

I'll note that we should look at count perf as well (maybe create another issue); it may have been fixed since this issue

Comment From: jreback

In [7]: df = pd.DataFrame({'x':np.random.randn(50000), ## produce the demo DataFrame 
   ...:    ...:                    'y':np.random.randn(50000),
   ...:    ...:                    'z':np.random.randn(50000)})
   ...: 
   ...: In [4]: buckets = {col : np.arange(int(df[col].min()) ,int(df[col].max())+2) 
   ...:    ...:            for col in df.columns} ## produce the unit bins
   ...: 
   ...: In [5]: cats = [pd.cut(df[col], bucket) for col,bucket in buckets.items()]
   ...: 
   ...: In [6]: grouped = df.groupby(cats) # group by the binned x,y,z
   ...: 
   ...: 

In [20]: %timeit -n1 -r1 grouped.x.size()
1 loop, best of 1: 642 µs per loop

In [9]: %timeit -n1 -r1 grouped.x.mean()
1 loop, best of 1: 2.98 ms per loop

In [10]: %timeit -n1 -r1 grouped.x.count()
1 loop, best of 1: 696 µs per loop

In [19]: %timeit -n1 -r1 grouped.x.agg(['mean','size'])
1 loop, best of 1: 1.62 ms per loop

this is implemented actually.