Pandas Sort()ing then selecting columns in a function apply()d to a grouped DataFrame

This code:

import pandas as pd
from numpy.random import randn


df = pd.DataFrame(randn(8, 4), columns=['A', 'B', 'C', 'D'])

def function(group):
    group.sort(columns = 'C', inplace=True)
    group = group[['A', 'B', 'C', 'D']]
    return group    

df2 = df.groupby(['A']).apply(function)

produces this error:

ValueError: cannot reindex from a duplicate axis

It's the combination of the sort and the column selecting inside of a grouped apply that causes the problem. I'm happy to give more detail on why I want to do this, but this is the simplest most striped down code that produces the error.

Comment From: jreback

is their a reason you are not just doing

In [10]: df.sort('C')
Out[10]: 
          A         B         C         D
7 -0.065432  0.476895 -1.933456 -0.225273
5  0.364656  1.510392 -0.552039  0.927939
0  0.144173 -1.230840 -0.551998 -0.103711
6  1.046028  0.906485 -0.449859 -0.185228
1 -0.467742  0.965226  0.546713 -1.300566
3 -0.687709  1.468811  1.031457 -0.760951
2  0.221976 -1.374526  1.753068  0.026533
4 -0.997729 -0.996212  2.454797 -1.431332

what are you expecting this to do?

Comment From: TDaltonC

Yes. In trying to make the smallest piece of code that would still produce the error, I made a script that doesn't actually do anything. What I'm actually trying to accomplish looks more like:

import pandas as pd
import numpy as np


df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : [0,0,3,3,2,2,1,1]})

def function(group):
    group.sort(columns = 'B', inplace=True)
    group['C'] = group['B'].diff()
    trimedGroup = group[['A', 'C']]
    return trimedGroup

df2 = df.groupby(['A']).apply(function)

And I expect to get back:

In[28]:df2
Out[28]: 
     A   E
0  foo NaN
1  bar NaN
2  foo   1
3  bar   1
4  foo   1
5  bar   1
6  foo   1
7  bar   1

I want to take the diff() of sort()ed groups.

Comment From: jreback

I think .diff needs a slightly different defintiion

as df.sort('B').groupby('A',as_index=False).B.diff() should work but raises a TypeError

In [42]: pd.concat([df,df.sort('B').groupby('A').B.diff()],axis=1)
Out[42]: 
     A  B   0
0  foo  0 NaN
1  bar  0 NaN
2  foo  3   1
3  bar  3   1
4  foo  2   1
5  bar  2   1
6  foo  1   1
7  bar  1   1

Comment From: jreback

You NEVER want to sort in a group, simply sort the entire frame beforehand. In fact you always want to do as much work as possible in a vectorized fashion.

Comment From: TomAugspurger

Duplicate of https://github.com/pandas-dev/pandas/issues/19437