related #2802

It seems that str[1] is significantly slower than .apply(lambda x: x[1])

See this So answer http://stackoverflow.com/a/18473330/1240268

Comment From: cpcloud

Couple of reasons: 1. the mapped function is actually lambda x: x[i] if len(x) > i else np.nan 2. isnull is called to compute a mask for mapping over in lib.map_infer_mask (which is in inference.pyx)

looks like the perf hit is about 2x, might be able to squash that by moving string methods to cython

Comment From: hayd

Ah, that'll do it (I guess it's only sometimes apply doesn't care about errors?).

maybe cythonizing these is the way forward, I guess even with object dtype you get some perf improvement.

Comment From: jreback

these methods could be much faster (there is an issue out there about this) if you basically push everything to use native c calls (eg stuff like strcmp and such) or maybe add a nice c library in the mix

just cythonizing doesn't help much

but this would be a bit of work

Comment From: cpcloud

wonder if this is worth looking into: http://bstring.sourceforge.net/

Comment From: jtratner

If you're going to c level, better to use a c library that handles strings / unicode for you so we don't have to worry as much about all the gotchas with c strings.

Comment From: cpcloud

darn bstring doesn't support unicdoe

Comment From: cpcloud

Converting these functions to C without breakage is going to be very difficult. You'll probably have to use ICU and have a compatibility layer between Cython (PyICU might make this a bit easier) and ICU.

We definitely cannot use the C standard library string functions since they don't handle Unicode.

Comment From: brandon-rhodes

Is anyone working on this currently? Would I be duplicating effort if I were to look into possible quick wins for at least getting these slow str routines a bit faster?

Comment From: jreback

@brandon-rhodes don't think so. would be great!

prob DO need some asv benchmarks for these.

Comment From: 3vts

Is this still an ongoing effort? I would like to give it a try

Comment From: brandon-rhodes

@3vts Feel free to give it a try! I did not wind up with time to make progress on it, and my guess is that the project I was on that needed the extra performance found a workaround. To be honest, I had, alas, forgotten all about it in the intervening years.

Comment From: mroeschke

This is fairly fast with the new pyarrow string type which have a lot of benefits over the string object implementation so closing for now. Can reopen if there are specific hotspots that are addressable