When we do a groupby transform/reduce that requires operating group-by-group, we construct a sorted (DataFrame|Series) so that we can iterate over it efficiently. That construction is cached within a DataSplitter class, but the splitter itself is not cached. IIUC we can get some mileage by caching the DataSplitter, at the possible cost of having a copy hang around longer than we might want.
Also we have a separate construct-a-sorted-object path in _numba_prep that might be able to re-use some code.
Final thought: we could check in DataSplitter.sorted_data whether _sort_idx is monotonic, in which case the (DataFrame|Series) is already sorted and we don't need to make a copy.
Comment From: jbrockmendel
if we iron out selected_obj vs obj_with_exclusions we might be able to go further and cache sorted_data. there are a few other arguments that can be passed to get_splitter, but i think we can plausibly get that down to just one
Comment From: lithomas1
@jbrockmendel
Tangentially related, but do you think we might be able to avoid materializing the entire sorted DF in DataSplitter.iter?
IIUC, since DataSplitter takes in the argsorted idxs, instead of materializing the whole array with take in __iter__
, can we slice the argsort idxs with the current start/end, and do a take from the DataFrame with those indexes?
I could take a closer look at this if you think it's the right approach.
Comment From: jbrockmendel
Worth a shot.
BTW I think we can avoid a take
altogether in already-sorted cases by checking if sort_idx is monotonic.