Pandas Higher Order Methods API - Nineya|java/go/python

Context

For methods that accept UDFs, I've been looking at a number of issues that would involve long-standing behavior changes, some I feel are quite significant:

.agg should always reduce #35725
Series.transform and DataFrame.transform should broadcast reduces (like groupby does) #18103
Series.apply and DataFrame.apply use agg for list/dict arguments
Series.agg and DataFrame.agg use apply if argument is a UDF
UDF method arguments are inconsistent #40112
Make groupby.apply dumb #39209
groupby.apply includes grouping column(s) in computation unlike groupby.agg/groupby.transform

For some of these, it is difficult to emit a FutureWarning. For example, I'm not sure how we would emit a warning changing the behavior of agg for UDFs. I've been working toward consolidating the methods for apply/agg/transform et al in pandas.core.apply, resolving consistencies as they are encountered. This is very much a work in progress, and I expect to encounter more.

Proposal

Add a new submodel pandas.core.homs with HOMs standing for "Higher Order Methods". These are methods that specify a callable function as their primary argument. Also add a new, experimental, implementation behind the option use_hom_api. Whenever possible, any changes will be contained within the pandas.core.homs module. Progress can then be made without worrying about deprecation warnings and changing behaviors. When it's ready, we can then progress as:

Add docs on the option classifying it as experimental.
Classify the option as stable.
Change the default of the option to no-default, emitting a FutureWarning when not set that the behavior of many methods will change and that users should use True.
Change the default of the option to True, keeping the option to allow users to change back if necessary.
Remove the option and old implementation.

Goals

Consistent behaviors between different Higher Order Methods
Consistent API between Higher Order Methods
Excisability: Should the project die out, it needs to be easy excise the experimental code. All changes need to be behind if get_option('use_homs_api'):.

Comment From: jbrockmendel

might fit on this list #12653 API: ban mutation within groupby.apply

Comment From: rhshadrach

Thanks @jbrockmendel - agreed and added.

Comment From: rhshadrach

Mentioned on the call today was how agg and apply use each other (e.g. #42833), and it was asked for some examples where the results differ. After looking again through the code, they end up with the same result more cases than I had originally thought, but here are two differences I found. I do believe there to be more (and happy to find some if these are deemed insignificant).

ser = pd.Series([], dtype=float)
print(ser.agg('sum'))  # Gives: 0.0
print(ser.apply('sum'))  # Gives: Series([], dtype: float64)

ser = pd.Series([1, 2, 3])
print(ser.agg(lambda x: x.sum()))  # Gives: 6
print(ser.apply(lambda x: x.sum()))  # Raises AttributeError

cc @jbrockmendel

Comment From: jbrockmendel

Thanks for finding the example. So if you made the changes discussed on the call, which of the behaviors in that example would change?

Comment From: rhshadrach

This got a bit long, so I'll just say first that the short version is that I think the desired behavior of agg is pretty clear, but apply sometimes isn't. I'm working on a PoC but it probably won't be ready for some time. The long version is...

I don't think any of these should change. The problem is mostly with list-likes.

ser = pd.Series([1, 2, 3])
print(ser.apply([lambda x: x.sum()]))

gives

<lambda>    6
dtype: int64

which is inconsistent with the example above. This is because apply on lists is implemented by just calling agg.

So we need to implement apply on lists. What you can't do is "just do the same thing as agg on lists, but use apply with each element of the list instead". This is because the way agg currently works on lists is to break up a DataFrame into individual Series, and then use agg with each element of the list on each Series. This poses a problem because apply and agg have different behaviors on DataFrames and Series. agg (when provided with a reducer) will always reduce dimension, DataFrame -> Series -> Scalar. On the other hand, while df.apply(foo) (when provided with a reducer) will behave just like agg (applying foo to each column), ser.apply(foo) will not (in most cases). This will attempt to apply foo to each row of the Series individually.

The upshot (if the above paragraph was even understandable) is that if you were to implement apply with lists similar to how agg with lists is implemented, df.apply([foo]) today will not behave at all like df.apply(foo). The former will do ser.apply(foo) to each Series that makes up df, which in turn will pass to each cell of df to foo individually, whereas df.apply(foo) will pass each Series to foo.

With this, I think the right thing to do is to just implement apply with lists as concatenating [df.apply(a) for a in arg] (where arg here is a list). But then you also want to do this with agg, otherwise the result of agg vs apply will have the MultiIndex columns that come out with different ordering of levels (aggregator names in level 0 vs level 1), and even if you swap the levels in the MultiIndex, the individual ordering within the levels is not the same. I've looked into this and am convinced that if we allow partial failure and duplicate column names, it is not possible to reorder reliably. The information you need, namely which columns have disappeared due to partial-failure, just isn't there.

Changing gears, the current problem with Series.agg is that it first tries apply, and then falls back to applying the given UDF to the entire Series. For numeric types when the UDF is an aggregator, I can't come up with any examples where the "try apply first" is actually successful. Using object dtypes is easy though:

ser = pd.Series([[1], [2], [3]])
print(ser.agg(lambda x: sum(x)))

This results in

0    1
1    2
2    3
dtype: int64

where I think the right result is the list [1, 2, 3]. This is easy to fix - just don't use apply at all in Series.agg.

Finally, the examples:

ser = pd.Series([], dtype=float)
print(ser.agg('sum'))  # Gives: 0.0
print(ser.apply('sum'))  # Gives: Series([], dtype: float64)

appear to me to be correct, but is inconsistent with ser.apply('sum') when the Series is non-empty. When it is non-empty, apply is identical to agg. To me it is also inconsistent that ser.apply(lambda x: x.sum()) applies to each row but ser.apply('sum') does not. But what should ser.apply('sum') do? The only answer I have is it be equivalent to ser.apply(lambda x: x.sum()) but that seems quite useless. Maybe that's okay though?