related #2802
It seems that str[1] is significantly slower than .apply(lambda x: x[1])
See this So answer http://stackoverflow.com/a/18473330/1240268
Comment From: cpcloud
Couple of reasons:
1. the mapped function is actually lambda x: x[i] if len(x) > i else np.nan
2. isnull
is called to compute a mask for mapping over in lib.map_infer_mask
(which is in inference.pyx
)
looks like the perf hit is about 2x, might be able to squash that by moving string methods to cython
Comment From: hayd
Ah, that'll do it (I guess it's only sometimes apply doesn't care about errors?).
maybe cythonizing these is the way forward, I guess even with object dtype you get some perf improvement.
Comment From: jreback
these methods could be much faster (there is an issue out there about this) if you basically push everything to use native c calls (eg stuff like strcmp and such) or maybe add a nice c library in the mix
just cythonizing doesn't help much
but this would be a bit of work
Comment From: cpcloud
wonder if this is worth looking into: http://bstring.sourceforge.net/
Comment From: jtratner
If you're going to c level, better to use a c library that handles strings / unicode for you so we don't have to worry as much about all the gotchas with c strings.
Comment From: cpcloud
darn bstring doesn't support unicdoe
Comment From: cpcloud
Converting these functions to C without breakage is going to be very difficult. You'll probably have to use ICU and have a compatibility layer between Cython (PyICU might make this a bit easier) and ICU.
We definitely cannot use the C standard library string functions since they don't handle Unicode.
Comment From: brandon-rhodes
Is anyone working on this currently? Would I be duplicating effort if I were to look into possible quick wins for at least getting these slow str
routines a bit faster?
Comment From: jreback
@brandon-rhodes don't think so. would be great!
prob DO need some asv benchmarks for these.
Comment From: 3vts
Is this still an ongoing effort? I would like to give it a try
Comment From: brandon-rhodes
@3vts Feel free to give it a try! I did not wind up with time to make progress on it, and my guess is that the project I was on that needed the extra performance found a workaround. To be honest, I had, alas, forgotten all about it in the intervening years.
Comment From: mroeschke
This is fairly fast with the new pyarrow string type which have a lot of benefits over the string object implementation so closing for now. Can reopen if there are specific hotspots that are addressable