Not a night and day improvement since all we're doing is removing some python overhead, but there does seem to be 2x+ performance to be picked up. Possibly could use some of the template machinery to make these easy to write.
I wouldn't consider this high priority given long term plans to replace the string dtype, but could be worth it.
import cython
%load_ext cython
s = pd.Series(np.random.choice(['aaaaaaaaaa', 'bbbbbbbb', 'ccccc' ,
'dddd'], size=20000).astype('O'))
%%cython
from numpy cimport *
import numpy as np
def fast_upper(ndarray values):
cdef:
Py_ssize_t i, n = values.shape[0]
ndarray output = np.empty_like(values)
str val
for i in range(n):
val = values[0]
output[i] = val.upper()
return output
%timeit s.str.upper()
100 loops, best of 3: 4.94 ms per loop
%timeit pd.Series(fast_upper(s.values), index=s.index)
100 loops, best of 3: 2.02 ms per loop
**Comment From: jreback**
you can actually get even better perf by using c-functions and maybe even release the GIL (though this is a bit trickier code).
**Comment From: jreback**
xref to #4694
**Comment From: chris-b1**
Yeah, looks like the cythonization isn't really what's helping in my example, it's the avoidance of na checks.
In [27]: %timeit pd.Series([x.upper() for x in s], index=s.index) 100 loops, best of 3: 2.74 ms per loop ```
Comment From: jorisvandenbossche
Now users have the option to use the Arrow-backed string dtype if they want better performance, it might not be needed to keep this issue open?
Comment From: jbrockmendel
I agree with Joris, closing as "supported via pyarrow"