Pandas Converting numpy array of strings to Series is much slower than with a list

Code Sample, a copy-pastable example if possible

>>> a_list = ['string'] * 1000000
>>> arr = np.array(a_list)

>>> %timeit pd.Series(a_list)
15.5 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit pd.Series(arr)
83.8 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> arr2 = np.random.rand(1000000)
>>> %timeit pd.Series(arr2)
24.8 µs ± 970 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Problem description

Usually converting a Numpy array to a Series is much faster than the same data in a list, but with strings, it's much slower (5x in this example).

Converting an array of 1 million random numbers to a Series is 3000x faster.

Comment From: jschendel

I think this is mostly due to the array you're creating needing dtype conversion. In numpy strings have unicode dtype, but for various reasons pandas uses object dtype.

Using the following setup:

In [3]: a_list = ['string'] * 1000000
   ...: arr = np.array(a_list)
   ...: arr2 = np.array(a_list, dtype=object)
   ...:

In [4]: arr.dtype
Out[4]: dtype('<U6')

In [5]: arr2.dtype
Out[5]: dtype('O')

Creating a Series from the object dtype array is by far the fastest:

In [6]: %timeit pd.Series(a_list)
17.4 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit pd.Series(arr)
141 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit pd.Series(arr2)
56.1 µs ± 163 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And the bulk of the time for creating the Series from your original array appears to be spent casting to object dtype:

In [9]: %timeit arr.astype(object)
138 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comment From: tdpetrou

Thanks for going through this @jschendel. The list of strings still need to be converted to object correct? So shouldn't the performance still be on par with for array to Series conversion?

Basically, a Python string --> Numpy object is much faster than NumPy unicode --> Numpy object? A 5x discrepancy seems really large.

Comment From: jschendel

I'm not 100% sure what I'm about to say is correct, so keep that in mind until someone more knowledgeable confirms/refutes, but it makes sense to me that python string -> numpy object is faster than numpy unicode -> numpy object.

Per my understanding, for most dtypes numpy arrays are stored as contiguous blocks of memory that have a single datatype (fixed length unicode in this case). However, arrays with object dtype are a bit different; the array is filled with pointers to python objects that are stored elsewhere in memory. When you start with a list of python strings you already have the python objects, but starting with a numpy unicode array you don't, so extra steps are needed to get python objects.

Note that converting the unicode array to a python list is consistent with the time it takes to use astype(object), which makes sense in light of the above:

In [2]: arr = np.array(['string'] * 1000000)

In [3]: %timeit arr.tolist()
142 ms ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit arr.astype(object)
139 ms ± 263 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I'm not sure how to quantify how long a process like this should take, or if a 5x discrepancy is indeed large; it could very well be the case that there is something suboptimal occurring here, but at the very least I don't find the 5x discrepancy especially surprising.

Comment From: tdpetrou

Thanks @jschendel. That explanation makes sense and sounds correct.

Comment From: jorisvandenbossche

Yes I think that is a perfect explanation. This can then be closed then I think, as even if you want to look into why it is 5x slower, that would be on the numpy side as almost all time to convert string array to Series is in the string array to object array conversion.