This code:
import pandas as pd
from numpy.random import randn
df = pd.DataFrame(randn(8, 4), columns=['A', 'B', 'C', 'D'])
def function(group):
group.sort(columns = 'C', inplace=True)
group = group[['A', 'B', 'C', 'D']]
return group
df2 = df.groupby(['A']).apply(function)
produces this error:
ValueError: cannot reindex from a duplicate axis
It's the combination of the sort and the column selecting inside of a grouped apply that causes the problem. I'm happy to give more detail on why I want to do this, but this is the simplest most striped down code that produces the error.
Comment From: jreback
is their a reason you are not just doing
In [10]: df.sort('C')
Out[10]:
A B C D
7 -0.065432 0.476895 -1.933456 -0.225273
5 0.364656 1.510392 -0.552039 0.927939
0 0.144173 -1.230840 -0.551998 -0.103711
6 1.046028 0.906485 -0.449859 -0.185228
1 -0.467742 0.965226 0.546713 -1.300566
3 -0.687709 1.468811 1.031457 -0.760951
2 0.221976 -1.374526 1.753068 0.026533
4 -0.997729 -0.996212 2.454797 -1.431332
what are you expecting this to do?
Comment From: TDaltonC
Yes. In trying to make the smallest piece of code that would still produce the error, I made a script that doesn't actually do anything. What I'm actually trying to accomplish looks more like:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'bar'],
'B' : [0,0,3,3,2,2,1,1]})
def function(group):
group.sort(columns = 'B', inplace=True)
group['C'] = group['B'].diff()
trimedGroup = group[['A', 'C']]
return trimedGroup
df2 = df.groupby(['A']).apply(function)
And I expect to get back:
In[28]:df2
Out[28]:
A E
0 foo NaN
1 bar NaN
2 foo 1
3 bar 1
4 foo 1
5 bar 1
6 foo 1
7 bar 1
I want to take the diff() of sort()ed groups.
Comment From: jreback
I think .diff
needs a slightly different defintiion
as df.sort('B').groupby('A',as_index=False).B.diff()
should work but raises a TypeError
In [42]: pd.concat([df,df.sort('B').groupby('A').B.diff()],axis=1)
Out[42]:
A B 0
0 foo 0 NaN
1 bar 0 NaN
2 foo 3 1
3 bar 3 1
4 foo 2 1
5 bar 2 1
6 foo 1 1
7 bar 1 1
Comment From: jreback
You NEVER want to sort in a group, simply sort the entire frame beforehand. In fact you always want to do as much work as possible in a vectorized fashion.
Comment From: TomAugspurger
Duplicate of https://github.com/pandas-dev/pandas/issues/19437