Currently the group-by-aggregation in pandas will create MultiIndex columns if there are multiple operation on the same column. However, this introduces some friction to reset the column names for fast filter and join. (If all operations could be chained together, analytics would be smoother)
df = pd.DataFrame([
['A', 1],
['A', 2],
['A', 3],
['B', 4],
['B', 5],
['B', 6]
], columns=['Key', 'Value'])
df1 = df.groupby('Key') \
.agg({
'Value': {'V1': 'sum', 'V2': 'count'}
})
df1.columns = df1.columns.droplevel() # This line introduce friction
df1.query('V1 > V2')
Expected Output
It would be great if there is a simple alias function for columns (like the pyspark's implementation), such as
# Just one approach, there may be others more appropriate
df.groupby('Key') \
.agg(
pd.Series.sum('Value').alias('V1'),
pd.Series.count('Value').alias('V2')
) \
.query('V1 > V2')
Comment From: jreback
In [14]: df.groupby('Key').Value.agg({'V1': 'sum', 'V2': 'count'}).query('V1 > V2')
Out[14]:
V2 V1
Key
A 3 6
B 3 15
You simply need to specify an aggregation directly on the Series.
PySpark in general is NOT an inspiration as things to be honest are much clunkier.
Comment From: PanWu
Thanks @jreback for the quick response. My previous example is over simplified, what about the following example
df = pd.DataFrame([
['A', 1, 85],
['A', 2, 45],
['A', 3, 23],
['B', 4, 76],
['B', 5, 43],
['B', 6, 56]
], columns=['Key', 'Value', 'Age'])
df1 = df.groupby('Key') \
.agg({
'Value': {'V1': 'sum', 'V2': 'count'},
'Age': {'AvgAge': 'mean', 'StdAge': 'std'}
})
Is there a way to handle multiple columns aggregation with multiple aggregation method in a convenient way?
Comment From: jreback
http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation
Comment From: wesm
FWIW I would also like to see a new groupby API entry point (not requiring keyword args) that does not yield a row index.
Comment From: PanWu
@jreback thanks, eventually figure out the schema is:
df1 = df.groupby('Key', as_index=False) \
['Value', 'Age'] \
.agg({
'Value': {'V1': 'sum', 'V2': 'count'},
'Age': {'AvgAge': 'mean', 'StdAge': 'std'}
})
@wesm that would be great! +1 on that