Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

pandas.core.groupby.SeriesGroupBy.apply and pandas.core.groupby.DataFrameGroupBy.apply need a callable that takes a series as its first argument and is usually quite slow.

While other apply functions such as pandas.core.window.rolling.Rolling.apply or pandas.DataFrame.apply have raw argument which allow pass ndarray object to the callable and thus make apply faster.

Feature Description

Make the function just like this:

def apply(func, raw=False, engine=None, engine_kwargs=None, args=None, kwargs=None):
    ...

[source]

Alternative Solutions

No

Additional Context

No response

Comment From: topper-123

I think it sounds reasonable and I'd like that the API is as similar as possible across the apply methods. Performance examples showing improvements will be needed too to show this is woth adding.

Comment From: PaleNeutron

A simple performance test on raw mode:

import pandas as pd
import numpy as np
s = pd.Series(np.random.rand(10000))
%%timeit
s.rolling(60).apply(lambda s: np.all(s[-1] >= s), raw=True) # with raw

34.1 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
s.rolling(60).apply(lambda s: (s.iloc[-1] >= s).all()) # without raw

1.03 s ± 8.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comment From: topper-123

Ok, I think this would be a good idea to include then. I think this should be the same for all apply methods though, so also Series.apply, DataFrameGroupby.apply (there may be some implementation differences between those methods, so achieving that may not be as simple as it sounds).

Comment From: rhshadrach

For DataFrame.apply, currently raw=True isn't fully supported for EAs (e.g. masked arrays with a null value). If we were to move to these as the default (which is I think the direction we're heading), it's not clear to me what would happen to the raw argument. Perhaps this is still worth it to add for NumPy dtypes in the meantime?