http://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe
related #414
Comment From: jreback
related #3066
Comment From: cpcloud
groupby in the backend?
In [5]: df = DataFrame(randn(10,10))
In [6]: gb = df.groupby(lambda x: x < 5, axis=0)
In [7]: [v for _, v in gb]
Out[7]:
[ 0 1 2 3 4 5 6 7 8 9
5 -0.047 0.813 -0.253 -1.443 -0.683 0.116 -0.155 0.159 0.359 0.497
6 -1.626 0.496 1.572 -1.056 0.579 0.312 -1.139 1.367 -0.158 1.679
7 -0.029 0.541 1.299 0.513 -0.562 0.489 0.408 -0.305 0.824 -0.200
8 0.318 -0.764 1.497 -1.704 -0.540 1.045 0.143 -0.457 -2.026 -0.795
9 -0.082 -1.585 0.623 0.251 -0.528 -0.270 0.874 -1.674 -0.711 -0.110,
0 1 2 3 4 5 6 7 8 9
0 -0.736 0.413 0.837 -1.141 -0.112 1.974 -0.861 -0.795 0.487 1.169
1 -1.150 0.914 -0.847 -0.009 1.028 -1.988 -1.140 -0.515 0.080 0.094
2 -1.013 0.546 -0.603 0.874 1.123 0.950 0.710 -2.143 -1.726 -1.555
3 -0.824 -0.051 -1.438 -0.821 -0.541 -0.851 -0.135 -0.331 -1.607 -0.250
4 -1.309 -0.197 -0.042 0.909 0.695 0.364 0.364 0.860 -1.074 1.805]
Comment From: ghost
In retrospect, #3066 actually points out two missing operations from the API: split_by
and partition
.
>>> [1 1 2 2 11].groupby( identity)
[(1,1,1,1) (2,2)]
>>> [1 1 2 2 11].partition(identity)
[(1,1) (2,2) (1,1)]
>>> [1 1 2 2 11].split_by(is_2)
[(1 1 2) (2) (1 1)]
partition and split_by can be thought of as the same operation with edge exclusive/inclusive semantics respectively.
Should probably return a groupby-like object, rather then a collection of frames directly like the SO question wanted. Easy to recover the frames from that. Though a map won't do since keys may not be unique. Just the per group operations provided by the container class.
Update:
>>> [1 2 3 4 5].partition(3,2)
[(1,2, 3) (3,4,5)]
related https://github.com/pydata/pandas/issues/5494, https://github.com/pydata/pandas/issues/936
Comment From: TomAugspurger
Another example where y-p's split_at
could be useful. In that case something like df.split_at(pd.isnull)
would partition into the contiguous groups of valid points. From there it would be .apply(lambda x: [x.head(1)['high'], x.tail(10)['low'])
or something like that.
Comment From: ghost
I think the groupby idiom can be usefull generalized to support different types of partitioning/splitting semantics.
One kink is that In general, group keys may not be distinct ( result keys may look like [1 2 1]). That's not a problem for the apply step which iterates over all the groups anyway. But we'll have to break away from groupby's dict mechanism in favor of of an ordered list of groups and a multisey mapping keys to positions in the group list.
The different kinds of split/partition/group semantics possible, such as inclusive/exclusive splitting may require a keyfunc that consumes a pair of (or n) rows (Examples: split when delta_foo > 0.3 for example, split on delta_moving_avarage(nwin) > 0.2), and I haven't come up with a good way to do that without getting baroque.
Allowing overlapping groups is another twist.
Should trim fluff features before attempting implementation.
Comment From: MarcoGorelli
closing as there's been no activity in about a decade, if there's a need for this feature I presume someone will comment / open a new issue (though at this point, in 2023, I doubt it would be accepted)