Pandas as_index =True for groupby transform inconsistent with agg

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
index = pd.date_range('10/1/1999', periods=1100, name='Date')

ts = pd.DataFrame({'Entry': np.random.normal(300, 50, 1100), 
                                'Exit':  np.random.normal(300, 50, 1100)}, index)

ts['Year'] = ts.index.year
ts['Day'] = ts.index.weekday_name

zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'],  as_index=True).transform(lambda g:g).head(5)

Problem description

Unlike agg, when using transform 1. as_index =True does not set the index to the groupby by (i.e ['Year', 'Day']). 2. as_index = False does not prepend the groupby by column to the output of the transform as is done with agg.

I found this inconsistency very confusing and hard to work with.

This is also noted in ##5755

Expected Output

If as_index =True, the index of the result should the groupby by (i.e ['Year', 'Day']).

if as_index =False, the output should have the the groupby by (i.e ['Year', 'Day']) prepended to the as is done with agg.

I've really struggled using transform, partly as the semantics of transform were unclear to me. Specifically : 1. Which columns is the transform applied to. Is it all columns of the df or only those not in the groupby by? It looks like its the latter. This seems to be true with agg too. It would have been helpful to me if this was clearer in the documentation

Is the transform passed a df of columns or is it applied column by column? My guess is it's the former if the transform takes a df otherwise it's the latter. This seems to be hinted at in some parts of the documentation but it could be made more explicit.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.41-36.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.11.3 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.3.0 numexpr: 2.6.1 matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.5.3 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.5 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.9.4 boto: 2.45.0 pandas_datareader: None

Comment From: jreback

I think you are confused as to the purpose of .transform. This is a reductive operation on a group to a scalar, which is then broadcast back to the original shape.

So I would say that what you are doing should actually raise a ValueError; we don't really check this fully.

.apply is a more general operator that will infer what you are doing / want.

Further I think we completely ignore as_index with transform as it makes no sense. Since it defaults to True, I think we could/should raise if its not True.

I will mark this as a better error reporting

Comment From: Kevin-McIsaac

I concur that I'm confused, however why should transform drop the groupby columns? Why is this more useful than returning the transformed columns with the groupby columns prepended as is done by agg?

Comment From: jreback

I think the docs are pretty clear: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation

Comment From: Kevin-McIsaac

The docs states "The transform method returns an object that is indexed the same", so it is clear that the index labels are the same, however it is silent on 1. Which columns are transformed. Excluding the "by" columns makes sense and is the same as agg. 2. What happens to the "by" columns and why dropping them is the right choice where as with agg they are retained.

Comment From: jreback

well I think that should be clear, the semantics are exactly the same as .agg, IOW on a DataFrameGroupby you get all columns but the groupers, and on a SeriesGrouppBy you get that column.

.transform is essentially a reductive operation on each group which is then broadcast back to the original size.

Comment From: Kevin-McIsaac

Yes the semantics are the same, making point 1 explicit (both for agg and transform) in the docs would help new users like myself.

What about point 2, why is it the right thing to drop the groupers columns, which is not how .agg works?

Comment From: jreback

well, if you want to a pull-request would be helpful.

its a transform, by-definition you are only affecting the non-groupers. you are returning the original just with replaced data for the non-groupers.

its rarely makes sense to perform an operation on grouped columns. for example normalizing doesn't make any sense on datetimes or strings, sure it makes sense on intergers (talking about groupers here), but that would be just weird.

Comment From: Kevin-McIsaac

I'm not suggesting transforming the groupers just prepending them to the transformed columns, which is the same as the .agg semantics.

The transformed DataFrame then has the same shape as the original, with the same indexes and same columns. I think that makes more sense.

It also enable as_index = True to have a sensible interpretation.

Comment From: jreback

that's not very useful. The point is to align the index on assign back. IOW a very common thing to do is.

df = df.assign(transformed=df.groupby(...).foo.transform(...))

which only makes sense if the returned column has the same index as the original.

Comment From: Kevin-McIsaac

BTW, thank you for persisting with me on this.

Agreed this is common, however if the groupers were retained (i.e., prepended as with .agg) in what I suspect is the common use case there isn't a need to do an assign.

Using my example you propose writing:

ts[['Entry', 'Exit']] = ts.groupby(['Year', 'Day']).transform(zscore)
````
using assignment to update ts as a side-effect. This works if the index is unique, and is a common approach, but fails if the index is not unique. In my case, the index is only unique within each group.

I'm proposing the following give the same result by pre-pended the groupers, as is done with .agg 
```python
ts.groupby(['Year', 'Day']).transform(zscore)

This has the advantages of: 1. being shorter, 2. the same behaviour as .agg 3. allows an interpretation of as_index 4. being side effect free.

Of course if I only want to apply the transform just Exit, then there is a problem and I'd need to write

ts[['Exit']] = ts.groupby(['Year', 'Day']).transform(zscore).Exit
````

**Comment From: jreback**

you can just use ``.apply`` to get what you want. ``.transform`` has a specific function. Please really learn why things are before trying to change them. 

**Comment From: Kevin-McIsaac**

Fair enough.  I did spend about 6 hrs on this before raising the issue. Taking your advice I tried apply in the obvious way:

```python
zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'],  as_index=True).apply(zscore)

but this gives

ValueError: can only convert an array of size 1 to a Python scalar

so I exclude the groupers

ts.groupby(['Year', 'Day'],  as_index=True).apply(lambda g: zscore(g[['Exit', 'Entry']]))

and then I get the same result as transform. So is this the semantics of transform, i.e., apply excluding groupers.

To get the result I've proposed I have to write

def my_transform(group, by, fn):
    transformed = fn(group.drop(by, axis='columns'))
    return pd.concat([group[by], transformed], axis='columns')

ts.groupby(['Year', 'Day']).apply(lambda g:my_transform(g, ['Year', 'Day'], zscore))

Comment From: rhshadrach

Seems to me too much is attempted being done within one line.

g = ts.groupby(['Year', 'Day'])
mean, std = g.transform('mean'), g.transform('std')
(ts[['Entry', 'Exit']] - mean)/std

This also has the benefit of not using apply, it should also be more performant but I haven't tested this.

I think this issue can be closed.

Pandas as_index =True for groupby transform inconsistent with agg

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`