Code Sample, a copy-pastable example if possible
# Your code here
df = pd.DataFrame({'City':['LA', 'NYC', 'NYC', 'LA', 'Chicago', 'NYC'],
'isFraud':[1, 0, 0, 1, 0, 1]})
df.groupby(df['City'])['isFraud'].agg({'Fraud':sum, 'Non-Fraud': lambda x: len(x)-sum(x), 'Squared': lambda x: (sum(x))**2})
Problem description
I just learnt using a dictionary for renaming in agg is going to be deprecated in the latest version. My question is what's the alternative to achieve the above, i.e. using multiple lambda functions within agg?
Expected Output
Non-Fraud Fraud Squared
City
Chicago 1 0 0
LA 0 2 4
NYC 2 1 1
Output of pd.show_versions()
Comment From: jorisvandenbossche
See the what's new docs for some pointers as alternatives using renaming afterwards: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#deprecate-groupby-agg-with-a-dictionary-when-renaming (although it will be a bit more convoluted)
Another option is have 'named' functions instead of lambda's:
In [14]: def Fraud(group):
...: return group.sum()
...:
In [15]: def NonFraud(group):
...: return len(group)-sum(group)
...:
In [16]: def Squared(group):
...: return (sum(group))**2
...:
In [20]: df.groupby(df['City'])['isFraud'].agg([Fraud, NonFraud, Squared])
Out[20]:
Fraud NonFraud Squared
City
Chicago 0 1 0
LA 2 0 4
NYC 1 2 1
Comment From: jreback
Another way, which I personally would do (this will also be more performant)
In [18]: df = pd.DataFrame({'City':['LA', 'NYC', 'NYC', 'LA', 'Chicago', 'NYC'],
...: 'isFraud':[1, 0, 0, 1, 0, 1]})
...:
...: result = df.groupby(df['City'])['isFraud'].agg(['sum', 'size'])
...: result = pd.DataFrame({'Fraud': result['sum'],
...: 'NonFraud': result['size']-result['sum'],
...: 'Squared': result['sum']**2})
...: result
...:
...:
...:
...:
Out[18]:
Fraud NonFraud Squared
City
Chicago 0 1 0
LA 2 0 4
NYC 1 2 1
Comment From: allen-q
Thanks @jorisvandenbossche and @jreback for your answers. It seems there is no short alternative way to do this in one line. It's sad to see such a handy feature to get removed.
Comment From: bridwell
So it looks like using a list of tuples, rather than a dict, is still supported?
df.groupby(df['City'])['isFraud'].agg([
('Fraud', sum),
('Non-Fraud', lambda x: len(x)-sum(x)),
('Squared', lambda x: (sum(x))**2)
])
Fraud Non-Fraud Squared
City
Chicago 0 1 0
LA 2 0 4
NYC 1 2 1
This even looks to be supported when applying different functions to different columns, similar to the dict approach:
df['severity'] = np.arange(len(df))
df.groupby(df['City']).agg({
'isFraud': [
('Fraud', sum),
('Non-Fraud', lambda x: len(x)-sum(x)),
('Squared', lambda x: (sum(x))**2)
],
'severity': [('avg severity', 'mean')]
})
isFraud severity
Fraud Non-Fraud Squared avg severity
City
Chicago 0 1 0 4.000000
LA 2 0 4 1.500000
NYC 1 2 1 2.666667
Will that continue to be supported?
Comment From: randomgambit
^^ this is even what is recommended in wes's last Pandas book. a bit confusing indeed
Comment From: GuiMarthe
How is the deprecation status of the tuple naming method that @bridwell mentions? Will it continue to be supported?
Comment From: eliadl
How does the documentation account for @bridwell's non-warning example? Please clarify in the documentation and add more examples as needed.