Pandas BUG/ENH: categorical returned during a transform

http://stackoverflow.com/questions/25372877/python-pandas-groupby-and-qcut-doesnt-work-in-0-14-1

df = pd.DataFrame({'x': np.random.rand(20), 'grp': ['a'] * 10 + ['b'] * 10})

worked in 0.13 (but returns a 2 element series with a list)

df.groupby('grp')['x'].transform(pd.qcut, 3)

much cleaner.

df.groupby('grp')['x'].apply(lambda x: pd.Series(pd.qcut(x,3), index=x.index))

Comment From: jreback

cc @JanSchulz maybe need some inference on the returned categoricals

Comment From: jankatins

Ok, here are multiple issues: - qcut should return a ordered categorical - df.groupby('grp')['x'].transform(pd.qcut, 3) should (almost) never work, as the returned categoricals have different levels and so can't be concat'ed into one Series - The error message is wrong: It should be about "Categoricals with different levels can't be merged into one Series" and not that something can't be converted to float.

One problem is that in case of categorical the as_type(common_type) doesn't work, as it never sees a categorical because the common_type is constructed from np.asarray(res) and is therefore not a cat anymore. On the other hand this is wrong a lot earlier: result = self._selected_obj.values.copy() -> this is leading to trouble, as the result type has nothing to do with the inputs -> categorical must be special cased here :-(

IMO the best way would be to construct a Series when we have the first res but with the index of the input. Not sure if that actually works (different length: index has all, res only for one group).

@jreback: can you merge the fixup PR, then I can submit a new PR for this one without the need to rebase constantly the whole PR.

Comment From: jreback

@JanSchulz ok, fixups merged.

yeh the transform(pd.qcut,3) is a little bogus , not saying it should work, just better error message

Comment From: jreback

@JanSchulz if you get to this before 0.15 ok, otherwise will push it (I may take a look as well).

Comment From: jankatins

Let's see if I find time tomorrow and if not it will have to wait until after my holidays :-/ The first point is addressed (qcut returns a ordered categorical).

Comment From: jankatins

This is not easy, this is the current code:

       dtype = self._selected_obj.dtype
        result = self._selected_obj.values.copy()

        wrapper = lambda x: func(x, *args, **kwargs)
        for i, (name, group) in enumerate(self):

            object.__setattr__(group, 'name', name)
            res = wrapper(group)

            if hasattr(res, 'values'):
                res = res.values

            # may need to astype
            try:
                common_type = np.common_type(np.array(res), result)
                if common_type != result.dtype:
                    result = result.astype(common_type)
            except:
                pass

            indexer = self._get_index(name)
            result[indexer] = res

        result = _possibly_downcast_to_dtype(result, dtype)

But this is not going to work with categoricals, because it stuffs the new values together with the old ones which will always convert the categoricals to a numpy type. There are three possible ways: - Rewrite the the function to use concat if res is Categorical (which will probably not be performant...) - simple throw a warning (categorical converted to string/number) or an error (categoricals can't be used in transform/....) - document that using transforms and returned categorical data is "undefined behaviour" and probbly not going to work

The "throw an error" variant is not always "correct", because one could return a categorical with the same categories for all groups and this should actually succeed.

Comment From: jankatins

@jreback can you comment here?

Comment From: jreback

@JanSchulz let me take a look.

Comment From: jreback

@JanSchulz this is related to #7883

The issue is that I did a short-cut to make Series.transformation fast. But that didn't with multiple-dtypes well.

So have to defer this (as you can use the work-around I suggested in this issue). We can fix for 0.15.1.

Comment From: TomAugspurger

This now returns an object series of Intervals (soon to be an IntervalArray).

I think the original issue is fixed.

In [140]: df.groupby('grp')['x'].transform(pd.qcut, 3)
Out[140]:
0       (0.796, 0.945]
1       (0.571, 0.796]
2       (0.571, 0.796]