Pandas API for splitting pandas objects

http://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe

related #414

Comment From: jreback

related #3066

Comment From: cpcloud

groupby in the backend?

In [5]: df = DataFrame(randn(10,10))

In [6]: gb = df.groupby(lambda x: x < 5, axis=0)

In [7]: [v for _, v in gb]
Out[7]:
[       0      1      2      3      4      5      6      7      8      9
5 -0.047  0.813 -0.253 -1.443 -0.683  0.116 -0.155  0.159  0.359  0.497
6 -1.626  0.496  1.572 -1.056  0.579  0.312 -1.139  1.367 -0.158  1.679
7 -0.029  0.541  1.299  0.513 -0.562  0.489  0.408 -0.305  0.824 -0.200
8  0.318 -0.764  1.497 -1.704 -0.540  1.045  0.143 -0.457 -2.026 -0.795
9 -0.082 -1.585  0.623  0.251 -0.528 -0.270  0.874 -1.674 -0.711 -0.110,
        0      1      2      3      4      5      6      7      8      9
0 -0.736  0.413  0.837 -1.141 -0.112  1.974 -0.861 -0.795  0.487  1.169
1 -1.150  0.914 -0.847 -0.009  1.028 -1.988 -1.140 -0.515  0.080  0.094
2 -1.013  0.546 -0.603  0.874  1.123  0.950  0.710 -2.143 -1.726 -1.555
3 -0.824 -0.051 -1.438 -0.821 -0.541 -0.851 -0.135 -0.331 -1.607 -0.250
4 -1.309 -0.197 -0.042  0.909  0.695  0.364  0.364  0.860 -1.074  1.805]

Comment From: ghost

In retrospect, #3066 actually points out two missing operations from the API: split_by and partition.

>>> [1 1 2 2 11].groupby( identity)
[(1,1,1,1) (2,2)]
>>> [1 1 2 2 11].partition(identity)
[(1,1) (2,2) (1,1)]
>>> [1 1 2 2 11].split_by(is_2)
[(1 1 2) (2) (1 1)]

partition and split_by can be thought of as the same operation with edge exclusive/inclusive semantics respectively.

Should probably return a groupby-like object, rather then a collection of frames directly like the SO question wanted. Easy to recover the frames from that. Though a map won't do since keys may not be unique. Just the per group operations provided by the container class.

Update:

>>> [1 2 3 4 5].partition(3,2)
[(1,2, 3) (3,4,5)]

related https://github.com/pydata/pandas/issues/5494, https://github.com/pydata/pandas/issues/936

Comment From: TomAugspurger

Another example where y-p's split_at could be useful. In that case something like df.split_at(pd.isnull) would partition into the contiguous groups of valid points. From there it would be .apply(lambda x: [x.head(1)['high'], x.tail(10)['low']) or something like that.

Comment From: ghost

I think the groupby idiom can be usefull generalized to support different types of partitioning/splitting semantics.

One kink is that In general, group keys may not be distinct ( result keys may look like [1 2 1]). That's not a problem for the apply step which iterates over all the groups anyway. But we'll have to break away from groupby's dict mechanism in favor of of an ordered list of groups and a multisey mapping keys to positions in the group list.

The different kinds of split/partition/group semantics possible, such as inclusive/exclusive splitting may require a keyfunc that consumes a pair of (or n) rows (Examples: split when delta_foo > 0.3 for example, split on delta_moving_avarage(nwin) > 0.2), and I haven't come up with a good way to do that without getting baroque.

Allowing overlapping groups is another twist.

Should trim fluff features before attempting implementation.

Comment From: MarcoGorelli

closing as there's been no activity in about a decade, if there's a need for this feature I presume someone will comment / open a new issue (though at this point, in 2023, I doubt it would be accepted)