Pandas GroupBy key does not get returned to output after transform operation on entire dataframe

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd
import numpy as np


def processCols(x):
    x_dtype = str(x.dtype)
    if x_dtype=='float64':
        y = x.fillna(x.mean())  # replace nan in col with mean
        return y
    elif x_dtype=='category':
        x_top = x.describe().top
        y = x.fillna(str(x_top))  # replace nan in col with most occuring cat
        return y
    elif x_dtype=='int64': #do nothing for now
        return x
    else:  # do nothing
        return x


# Test snippet

#pid,typeid,bnum,bcat
#41,A,1.0,3D
#42,A,2.0,3D
#43,B,4.5,3E
#44,A,4.0,3D
#45,A,5.0,4D
#46,B,3.0,3E
#47,A,1.1,4D
#48,B,2.0,3F
#49,A,,
#50,A,,
#51,B,,
#52,B,,

rawData = pd.DataFrame({'pid' : np.array([41,42,43,44,45,46,47,48,49,50,51,52], dtype='int64'),
                        'typeid' : pd.Categorical(["A","A","B","A","A","B","A","B","A","A","B","B"]),
                        'bnum' : np.array([1,2,4.5,4,5,3,1.1,2,np.nan,np.nan,np.nan,np.nan],dtype='float64'),
                        'bcat' : pd.Categorical(["3D","3D","3E","3D","4D","3E","4D","3F",np.nan,np.nan,np.nan,np.nan])
                       })


groupedBy = rawData.groupby('typeid')

for name, group in groupedBy:
    print('By Groups:')
    print(name)
    print('Before nan correction')
    print(group)
    transformed = group.transform(processCols)
    print()
    print('After nan correction')
    print(transformed)
    print()

# Now process all at once to compare 
print()     
print('transformed all at once')  
transformed_all_at_once = groupedBy.transform(processCols)
print(transformed_all_at_once)

Problem description

When using groupby.transform and a custom function on a full data frame, the groupby key does not get returned in the final glued-together output. However, if this same process is done by iterating over the groups, the individual data frames do return the groupby key. If the groupby key is getting removed, then it requires an external join to reintroduce it. However, the individual replacements for the nan (which is the what the custom function does) is correct in both cases. Thanks.

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

Expected Output

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: None numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: 1.6.2 patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: gfyoung

@td2014 : Thanks reporting this! Could you also post the actual output that you're seeing and the output that you expect?

Comment From: td2014

@gfyoung : Here is the output as the code stands. You'll notice that the output of the variable 'transformed all at once' (near the bottom of the printout) is missing the groupby key, which is 'typeid'. However, when the iteration by groups is looping, the returned dataframe from transform includes 'typeid'. The output I expect from the 'transformed all at once' section should also include the 'typeid' column which was the groupby key. Is this not the correct expectation? Thanks. ```python By Groups: A Before nan correction bcat bnum pid typeid 0 3D 1.0 41 A 1 3D 2.0 42 A 3 3D 4.0 44 A 4 4D 5.0 45 A 6 4D 1.1 47 A 8 NaN NaN 49 A 9 NaN NaN 50 A

After nan correction bcat bnum pid typeid 0 3D 1.00 41 A 1 3D 2.00 42 A 3 3D 4.00 44 A 4 4D 5.00 45 A 6 4D 1.10 47 A 8 3D 2.62 49 A 9 3D 2.62 50 A

By Groups: B Before nan correction bcat bnum pid typeid 2 3E 4.5 43 B 5 3E 3.0 46 B 7 3F 2.0 48 B 10 NaN NaN 51 B 11 NaN NaN 52 B

After nan correction bcat bnum pid typeid 2 3E 4.500000 43 B 5 3E 3.000000 46 B 7 3F 2.000000 48 B 10 3E 3.166667 51 B 11 3E 3.166667 52 B

transformed all at once bcat bnum pid 0 3D 1.000000 41 1 3D 2.000000 42 2 3E 4.500000 43 3 3D 4.000000 44 4 4D 5.000000 45 5 3E 3.000000 46 6 4D 1.100000 47 7 3F 2.000000 48 8 3D 2.620000 49 9 3D 2.620000 50 10 3E 3.166667 51 11 3E 3.166667 52

Comment From: gfyoung

@td2014 : Thanks for doing this! It does indeed look a little weird when you break it down like that. Do note that you are calling .transform on different types of objects in the loop than when you do the all-at-once. Let's ping @jreback and @jorisvandenbossche to see what they think.

Comment From: mdziel

Indeed, I also noticed this behavior and it seems both weird and inconsistent with previous versions of pandas. Previously, groupby('groupvar').fillna() would add groupvar to the index of the resulting dataframe but now in pandas 1.3.3 groupvar is simply gone from the output.

Comment From: rhshadrach

This is by design, see https://pandas.pydata.org/pandas-docs/dev/user_guide/groupby.html#transformation

Pandas GroupBy key does not get returned to output after transform operation on entire dataframe

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`