Problem description

I'm searching without success a way to sort the results of pd.Series.str.split.

I was expecting a str method, as we have it for example to access slices of the results, but it seems that it doesn't exists (tried sorted, sort, sort_values and googled).

As a workaround, I'm using apply on the split result, but it is not vectorized, so I guess that it can be made faster with vectorized methods.

Unfortunately, I don't see how we can implement the additional arguments of the python sorted to that vectorization, especially the key and reverse arguments. It would be great to have that possibility as well.

Thanks!

Code Sample

import pandas as pd

###########
# Desired #
###########
pd.Series(["a b:d c"]).str.replace(':', ' ').str.split(' ').str.sorted.str.join(' ')
# ---------------------------------------------------------------------------
# AttributeError                            Traceback (most recent call last)
# <ipython-input-2-6fb06dcfcab2> in <module>()
# ----> 1 pd.Series(["a b:d c"]).str.replace(':', ' ').str.split(' ').str.sorted.str.join(' ')
# 
# AttributeError: 'StringMethods' object has no attribute 'sorted'

##############
# Workaround #
##############
pd.Series(["a b:d c"]).str.replace(':', ' ').str.split(' ').apply(sorted).str.join(' ')
# 0    a b c d
# dtype: object

Used versions

  • Python 3.6.1 (Anaconda)
  • Pandas 0.20.3

Comment From: gfyoung

@mhooreman , thanks for reporting this! Sorting arraylike-elements within Series is an unusual use-case AFAIK, so I'm not sure whether we would find it worthwhile to implement such a vectorization ourselves unless we get more requests for such a feature OR we can identify a very strong (and more common) use-case for people.

@jreback : your thoughts?

Comment From: mhooreman

@gfyoung Thanks for your comment. I guess that it is indeed not common, and I use the workaround in the meanwhile. So, no stress, it it can't be done now, that's life. Maybe I'll do it when I'll be able to :-) Thanks again.

Comment From: jreback

embedding lists is not very efficient in pandas. you can try semething like this, which is roughly equivalent to explode.

In [14]:  pd.Series(["a b:d c"]).str.replace(':', ' ').str.split(' ', expand=True).stack().sort_values()
Out[14]: 
0  0    a
   1    b
   3    c
   2    d
dtype: object

I don't think any machinery for sorting in-a-list will be supported. You can convert to a DataFrame/Series and simply use the vectorized routines.

Comment From: mhooreman

Thanks @jreback . I was stupidly not thinking about the stack.