Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
index = pd.date_range('10/1/1999', periods=1100, name='Date')
ts = pd.DataFrame({'Entry': np.random.normal(300, 50, 1100),
'Exit': np.random.normal(300, 50, 1100)}, index)
ts['Year'] = ts.index.year
ts['Day'] = ts.index.weekday_name
zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'], as_index=True).transform(lambda g:g).head(5)
Problem description
Unlike agg, when using transform 1. as_index =True does not set the index to the groupby by (i.e ['Year', 'Day']). 2. as_index = False does not prepend the groupby by column to the output of the transform as is done with agg.
I found this inconsistency very confusing and hard to work with.
This is also noted in ##5755
Expected Output
If as_index =True, the index of the result should the groupby by (i.e ['Year', 'Day']).
if as_index =False, the output should have the the groupby by (i.e ['Year', 'Day']) prepended to the as is done with agg.
I've really struggled using transform, partly as the semantics of transform were unclear to me. Specifically : 1. Which columns is the transform applied to. Is it all columns of the df or only those not in the groupby by? It looks like its the latter. This seems to be true with agg too. It would have been helpful to me if this was clearer in the documentation
- Is the transform passed a df of columns or is it applied column by column? My guess is it's the former if the transform takes a df otherwise it's the latter. This seems to be hinted at in some parts of the documentation but it could be made more explicit.
Output of pd.show_versions()
Comment From: jreback
I think you are confused as to the purpose of .transform
. This is a reductive operation on a group to a scalar, which is then broadcast back to the original shape.
So I would say that what you are doing should actually raise a ValueError
; we don't really check this fully.
.apply
is a more general operator that will infer what you are doing / want.
Further I think we completely ignore as_index
with transform as it makes no sense. Since it defaults to True
, I think we could/should raise if its not True
.
I will mark this as a better error reporting
Comment From: Kevin-McIsaac
I concur that I'm confused, however why should transform drop the groupby columns? Why is this more useful than returning the transformed columns with the groupby columns prepended as is done by agg?
Comment From: jreback
I think the docs are pretty clear: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation
Comment From: Kevin-McIsaac
The docs states "The transform method returns an object that is indexed the same", so it is clear that the index labels are the same, however it is silent on 1. Which columns are transformed. Excluding the "by" columns makes sense and is the same as agg. 2. What happens to the "by" columns and why dropping them is the right choice where as with agg they are retained.
Comment From: jreback
well I think that should be clear, the semantics are exactly the same as .agg
, IOW on a DataFrameGroupby
you get all columns but the groupers, and on a SeriesGrouppBy
you get that column.
.transform
is essentially a reductive operation on each group which is then broadcast back to the original size.
Comment From: Kevin-McIsaac
Yes the semantics are the same, making point 1 explicit (both for agg and transform) in the docs would help new users like myself.
What about point 2, why is it the right thing to drop the groupers columns, which is not how .agg works?
Comment From: jreback
well, if you want to a pull-request would be helpful.
its a transform, by-definition you are only affecting the non-groupers. you are returning the original just with replaced data for the non-groupers.
its rarely makes sense to perform an operation on grouped columns. for example normalizing doesn't make any sense on datetimes or strings, sure it makes sense on intergers (talking about groupers here), but that would be just weird.
Comment From: Kevin-McIsaac
I'm not suggesting transforming the groupers just prepending them to the transformed columns, which is the same as the .agg semantics.
The transformed DataFrame then has the same shape as the original, with the same indexes and same columns. I think that makes more sense.
It also enable as_index = True to have a sensible interpretation.
Comment From: jreback
that's not very useful. The point is to align the index on assign back. IOW a very common thing to do is.
df = df.assign(transformed=df.groupby(...).foo.transform(...))
which only makes sense if the returned column has the same index as the original.
Comment From: Kevin-McIsaac
BTW, thank you for persisting with me on this.
Agreed this is common, however if the groupers were retained (i.e., prepended as with .agg) in what I suspect is the common use case there isn't a need to do an assign.
Using my example you propose writing:
ts[['Entry', 'Exit']] = ts.groupby(['Year', 'Day']).transform(zscore)
````
using assignment to update ts as a side-effect. This works if the index is unique, and is a common approach, but fails if the index is not unique. In my case, the index is only unique within each group.
I'm proposing the following give the same result by pre-pended the groupers, as is done with .agg
```python
ts.groupby(['Year', 'Day']).transform(zscore)
This has the advantages of: 1. being shorter, 2. the same behaviour as .agg 3. allows an interpretation of as_index 4. being side effect free.
Of course if I only want to apply the transform just Exit, then there is a problem and I'd need to write
ts[['Exit']] = ts.groupby(['Year', 'Day']).transform(zscore).Exit
````
**Comment From: jreback**
you can just use ``.apply`` to get what you want. ``.transform`` has a specific function. Please really learn why things are before trying to change them.
**Comment From: Kevin-McIsaac**
Fair enough. I did spend about 6 hrs on this before raising the issue. Taking your advice I tried apply in the obvious way:
```python
zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'], as_index=True).apply(zscore)
but this gives
ValueError: can only convert an array of size 1 to a Python scalar
so I exclude the groupers
ts.groupby(['Year', 'Day'], as_index=True).apply(lambda g: zscore(g[['Exit', 'Entry']]))
and then I get the same result as transform. So is this the semantics of transform, i.e., apply excluding groupers.
To get the result I've proposed I have to write
def my_transform(group, by, fn):
transformed = fn(group.drop(by, axis='columns'))
return pd.concat([group[by], transformed], axis='columns')
ts.groupby(['Year', 'Day']).apply(lambda g:my_transform(g, ['Year', 'Day'], zscore))
Comment From: rhshadrach
Seems to me too much is attempted being done within one line.
g = ts.groupby(['Year', 'Day'])
mean, std = g.transform('mean'), g.transform('std')
(ts[['Entry', 'Exit']] - mean)/std
This also has the benefit of not using apply
, it should also be more performant but I haven't tested this.
I think this issue can be closed.