hello When using the mighty groupby function, I found myself quite disappointed by something: indeed, it is very useful regarding operations, but one of the most basic one is not mathematical. It is about getting the values associated to a column (or series of columns) per group.
Maybe I missed something about the split-apply-combine strategy*, but I don't get how to retrieve simply the content of some rows for all or part of groups. get_group() does this beautifully for one but not for many (i tried to pass sets, lists at no effect), I can come up easily with something like: groups_dict= {k: list(grouped.get_group(k).loc[:,idcolumn]) for k in grouped.groups.keys()}
But I guess this is not computationally efficient, plus it has to be repeated for all columns you want to aggregate.
Are there any better methods or wouldn't this be relevant to have a setting to output the actual values instead of an index?
I actually came up with another solution (agg_df being the original dataframe, you may need resetting the df index in case you used some exploding strategy):
groups_dict2= {k: list(agg_df2.loc[v,'n']) for k,v in grouped.indices.items()}
It feels slower but it is maybe more error proof?
Anyway, a function using all the hidden workings in Pandas would be far better I guess.
Best Regards
- (I am sorry, I try my best reading the docs, but Pandas is definitely not trivial to handle, docs are huge and when you think that you might find some problem's solution in one section or function domain, it can actually be something hugely different),
Comment From: jreback
pls show an explict example with code of your input and output. not really sure what you are after.
Comment From: BLMeltdown
Well that's a part of a longer question in two parts.
Let's say my df is:
index "A" "B" 0 A1 "B1,B2,B3" 1 A2 "B2,B4,B3" 2 A3 "B2,B3,B5"
and I want to do magical_function(df)
index "B'" "A''" 0 B1 "A1" 1 B2 "A1,A2,A3" 2 B3 "A1,A2,A3" 3 B4 "A2" 4 B5 "A3"
So I used an exploding strategy i thus dropna first to avoid mistakes, then I make a Series with the column to split, I explode it and stack it, and then the join magic with the same index duplicate the "A" columns values where needed dcolumn="A" col="B" current_wdf=df[[idcolumn,col]].dropna() current_col=current_wdf.loc[:,col] exploded_df=current_col.str.split('\,').apply(pd.Series,1).stack()#much slower but keep the index. I could used substitution with enumerate after dropping level exploded_df.index=exploded_df.index.droplevel(-1) exploded_df.name=col agg_df=pd.DataFrame(current_wdf.loc[:,idcolumn]).join(exploded_df) grouped=agg_df.groupby([col])
After what I have: index "B'" "A''" 0 B1 "A1" 1 B2 "A1" 1 B2 "A2" 1 B2 "A3" 2 B3 "A1" 2 B3 "A2" 2 B3 "A3" 3 B4 "A2" 4 B5 "A3"
0 1
0 B1 A1 1 B2 A1 1 B2 A2 1 B2 A3 2 B3 A1 2 B3 A2 2 B3 A3 3 B4 A2 4 B5 A3
Then I do grouped=agg_df.groupby([col]) gives a dict {'B1': Int64Index([0], dtype='int64'), 'B2': Int64Index([1, 1, 1], dtype='int64'), 'B3': Int64Index([2, 2, 2], dtype='int64'), 'B4': Int64Index([3], dtype='int64'), 'B5': Int64Index([4], dtype='int64')}
To have the dataframe I desire, I need to get past the "groups" limitation of only displaying the indexes and doing this groups_dict= {k: list(grouped.get_group(v).loc[:,idcolumn]) for k, v in grouped.groups.items()}
And I finally have the dataframe.
But that was less than trivial, and I am dubious about the last part. It works, but it is slow, and could easily break.
Are there no operation for such matching reversal process? And, for groups content retrieval such as I exposed, are there any way to grab the contents instead of the indexes without have to perform repeatedly get_group?
Comment From: jreback
better to ask this on Stack Overflow.
That should be your first stop for usage type questions. If its deemed a bug/enhancement then certainly open an issue here.
Comment From: BLMeltdown
Ok, sorry