Problem description
I'm searching without success a way to sort the results of pd.Series.str.split
.
I was expecting a str method, as we have it for example to access slices of the results, but it seems that it doesn't exists (tried sorted
, sort
, sort_values
and googled).
As a workaround, I'm using apply
on the split
result, but it is not vectorized, so I guess that it can be made faster with vectorized methods.
Unfortunately, I don't see how we can implement the additional arguments of the python sorted
to that vectorization, especially the key
and reverse
arguments. It would be great to have that possibility as well.
Thanks!
Code Sample
import pandas as pd
###########
# Desired #
###########
pd.Series(["a b:d c"]).str.replace(':', ' ').str.split(' ').str.sorted.str.join(' ')
# ---------------------------------------------------------------------------
# AttributeError Traceback (most recent call last)
# <ipython-input-2-6fb06dcfcab2> in <module>()
# ----> 1 pd.Series(["a b:d c"]).str.replace(':', ' ').str.split(' ').str.sorted.str.join(' ')
#
# AttributeError: 'StringMethods' object has no attribute 'sorted'
##############
# Workaround #
##############
pd.Series(["a b:d c"]).str.replace(':', ' ').str.split(' ').apply(sorted).str.join(' ')
# 0 a b c d
# dtype: object
Used versions
- Python 3.6.1 (Anaconda)
- Pandas 0.20.3
Comment From: gfyoung
@mhooreman , thanks for reporting this! Sorting arraylike-elements within Series
is an unusual use-case AFAIK, so I'm not sure whether we would find it worthwhile to implement such a vectorization ourselves unless we get more requests for such a feature OR we can identify a very strong (and more common) use-case for people.
@jreback : your thoughts?
Comment From: mhooreman
@gfyoung Thanks for your comment. I guess that it is indeed not common, and I use the workaround in the meanwhile. So, no stress, it it can't be done now, that's life. Maybe I'll do it when I'll be able to :-) Thanks again.
Comment From: jreback
embedding lists is not very efficient in pandas. you can try semething like this, which is roughly equivalent to explode.
In [14]: pd.Series(["a b:d c"]).str.replace(':', ' ').str.split(' ', expand=True).stack().sort_values()
Out[14]:
0 0 a
1 b
3 c
2 d
dtype: object
I don't think any machinery for sorting in-a-list will be supported. You can convert to a DataFrame/Series and simply use the vectorized routines.
Comment From: mhooreman
Thanks @jreback . I was stupidly not thinking about the stack
.