Pandas ENH: make contiguous groupby easier

itertools.groupby groups things contiguously-- great for run length encoding, not so great for partitioning. This necessitates the groupby(sorted(items,key=keyfn), keyfn) dance if you want to separate it. That's not always what you want either, so you wind up writing

def partition(seq, keyfn):
    d = {}
    for x in seq:
        d.setdefault(keyfn(x), []).append(x)
    return d

and so on.

DataFrame.groupby is great for data partitioning, but merges discontiguous groups. Wanting to cluster timeseries -- first x since the last y, etc. -- is a common task. With some cumsum hacks you can do it, but "get a boolean series, see when it's equal to its shifted value to find the transitions, take advantage of the fact that False == 0 and True == 1 to cumsum that to get something which grows for each cluster, and then groupby on that" is maybe a little more than I'd expect a beginner to have to do to get back what itertools.groupby does naturally. And if there's an easier way, then we at least should at least make it more obvious.

I'm not sure what the best way to proceed is, but I've answered variants of this several times on SO, and people wanting a cumsum/cumprod-with-reset is a pretty common numpy request.

Comment From: cpcloud

Big +1 here. I often wish I could keep the discontinuity of groups. Maybe a merge_groups=True keyword?

Comment From: jreback

@dsm054 can you put up a simple example (and use the cumsum soln) so can see what this looks like?

Comment From: dsm054

@jreback: I often do something like

>>> df = pd.DataFrame({"A": [1,1,2,3,2,2,3], "B": [1]*7})
>>> df
   A  B
0  1  1
1  1  1
2  2  1
3  3  1
4  2  1
5  2  1
6  3  1

[7 rows x 2 columns]
>>> df.groupby("A")["B"].sum()
A
1    2
2    3
3    2
Name: B, dtype: int64
>>> df.groupby((df.A != df.A.shift()).cumsum())["B"].sum()
A
1    2
2    1
3    1
4    2
5    1
Name: B, dtype: int64

which seems obvious now but I remember it not being at all obvious the first time I did it. There's also the "new groups start at delimiters" (df.A == header).cumsum() variant.

Maybe this should be closed in favour of #4059 which seems broader in scope.

Comment From: jreback

ok...do you want to contribute that as a cookrecipe and in groupby.rst (in the examples section at the end)?...

i'll change this issue to a doc issue then

Comment From: jreback

though...not averse to a partition function as well ?

Comment From: shumpohl

I implemented some little helper for this since I require it quite often and the workaround performance was not sufficient:

def consecutive_groupby(df: pd.DataFrame,
                        columns: Union[Hashable, List[Hashable]]) -> Iterator[Tuple[Any, pd.DataFrame]]:
    group_vals = df[columns]

    splits = np.not_equal(group_vals.values[1:, ...], group_vals.values[:-1, ...])
    if splits.ndim > 1:
        splits = splits.any(axis=1)
        def get_group_val(i): return tuple(group_vals.values[i])
    else:
        get_group_val = group_vals.values.__getitem__

    split_idx = np.flatnonzero(splits)
    split_idx += 1

    start_idx = 0
    for idx in split_idx:
        group_val = get_group_val(start_idx)
        yield group_val, df.iloc[start_idx:idx, :]
        start_idx = idx

    group_val = get_group_val(start_idx)
    yield group_val, df.iloc[start_idx:, :]