Currently the group-by-aggregation in pandas will create MultiIndex columns if there are multiple operation on the same column. However, this introduces some friction to reset the column names for fast filter and join. (If all operations could be chained together, analytics would be smoother)

df = pd.DataFrame([
        ['A', 1],
        ['A', 2],
        ['A', 3],
        ['B', 4],
        ['B', 5],
        ['B', 6]
    ], columns=['Key', 'Value'])

df1 = df.groupby('Key') \
    .agg({
            'Value': {'V1': 'sum', 'V2': 'count'}
        })

df1.columns = df1.columns.droplevel() # This line introduce friction
df1.query('V1 > V2')

Expected Output

It would be great if there is a simple alias function for columns (like the pyspark's implementation), such as


# Just one approach, there may be others more appropriate
df.groupby('Key') \
    .agg(
            pd.Series.sum('Value').alias('V1'),
            pd.Series.count('Value').alias('V2')
        ) \
    .query('V1 > V2')

Comment From: jreback

In [14]: df.groupby('Key').Value.agg({'V1': 'sum', 'V2': 'count'}).query('V1 > V2')
Out[14]: 
     V2  V1
Key        
A     3   6
B     3  15

You simply need to specify an aggregation directly on the Series.

PySpark in general is NOT an inspiration as things to be honest are much clunkier.

Comment From: PanWu

Thanks @jreback for the quick response. My previous example is over simplified, what about the following example

df = pd.DataFrame([
        ['A', 1, 85],
        ['A', 2, 45],
        ['A', 3, 23],
        ['B', 4, 76],
        ['B', 5, 43],
        ['B', 6, 56]
    ], columns=['Key', 'Value', 'Age'])

df1 = df.groupby('Key') \
    .agg({
            'Value': {'V1': 'sum', 'V2': 'count'},
            'Age': {'AvgAge': 'mean', 'StdAge': 'std'}
        })

Is there a way to handle multiple columns aggregation with multiple aggregation method in a convenient way?

Comment From: jreback

http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation

Comment From: wesm

FWIW I would also like to see a new groupby API entry point (not requiring keyword args) that does not yield a row index.

Comment From: PanWu

@jreback thanks, eventually figure out the schema is:

df1 = df.groupby('Key', as_index=False) \
    ['Value', 'Age'] \
    .agg({
            'Value': {'V1': 'sum', 'V2': 'count'},
            'Age': {'AvgAge': 'mean', 'StdAge': 'std'}
        })

@wesm that would be great! +1 on that