Code Sample, a copy-pastable example if possible
On master:
In [3]: values = ['a', 'b', np.nan]
In [4]: lib.infer_dtype(np.array(values))
Out[4]: 'string'
This returns 'mixed' if the list itself or a Series is passed:
In [5]: lib.infer_dtype(values)
Out[5]: 'mixed'
In [6]: lib.infer_dtype(pd.Series(values))
Out[6]: 'mixed'
This is a problem because it impacts any function that uses hash tables, as it causes NaN to be returned as a string (notice the quotes around 'nan'
below):
In [7]: pd.unique(np.array(values))
Out[7]: array(['a', 'b', 'nan'], dtype=object)
Again, this doesn't happen if the list itself or a Series is passed:
In [8]: pd.unique(values)
Out[8]: array(['a', 'b', nan], dtype=object)
In [9]: pd.unique(pd.Series(values))
Out[9]: array(['a', 'b', nan], dtype=object)
Which appears to be caused by infer_dtype
returning 'string' instead of 'mixed' here:
https://github.com/pandas-dev/pandas/blob/488db6f9a0f19e1d18559e6c2056e9545fe14704/pandas/core/algorithms.py#L207-L211
Problem description
infer_dtype
incorrectly returns 'string', causing NaN to get converted to a string by functions that use hash tables.
Expected Output
infer_dtype
to return 'mixed'.
Output of pd.show_versions()
Comment From: jreback
this is not a pandas issue, rather numpy is broken.
In [1]: values = ['a', 'b', np.nan]
...:
...: np.array(values)
...:
Out[1]:
array(['a', 'b', 'nan'],
dtype='<U3')
but is correct if object dtype is specified.
In [2]: values = ['a', 'b', np.nan]
...:
...: np.array(values, dtype=object)
...:
...:
Out[2]: array(['a', 'b', nan], dtype=object)
Comment From: jreback
this is why we have _ensure_arraylike
FYI, but if something comes to pandas already as an incorrectly converted numpy array we are out-of-luck (yes in theory you could scan for 'nan', but that's not a great idea).
Comment From: jschendel
Ah, should have checked the underlying numpy array first.
_ensure_arraylike
is helpful. I created a 2d specific version of it (for the df.astype('category')
PR), but looks like I wasn't exactly duplicating the logic when it comes to the NaN as a string problem, which is what ultimately led to me opening this issue.
Or should I not have a 2d specific version, and just try to patch _ensure_arraylike
itself? The issue I'm running into is when a list of lists gets passed, it gives a numpy array of lists instead of a 2d numpy array:
In [22]: values = [['a', 'b', 'c', 'a'], ['b', np.nan, 'd', 'd']]
In [23]: _ensure_arraylike(values)
Out[23]: array([list(['a', 'b', 'c', 'a']), list(['b', nan, 'd', 'd'])], dtype=object)
whereas I'd like to get output along the lines of:
In [24]: np.array([_ensure_arraylike(x) for x in values])
Out[24]:
array([['a', 'b', 'c', 'a'],
['b', nan, 'd', 'd']], dtype=object)