Say I make a category-type Series:
s = pd.Series(["a", "b", "c"], dtype="category")
I want to pass this to a function that expects a numpy array, so I use the values
property. According to the documentation:
s.values?
Type: property
String form: <property object at 0x106d84050>
Docstring:
Return Series as ndarray
Returns
-------
arr : numpy.ndarray
However:
s.values
[a, b, c]
Categories (3, object): [a < b < c]
Comment From: mwaskom
Aside from the documentation issue, I thought that using .values
was the canonical way to represent a Series as an ndarray, so this is potentially a problematic API change.
Comment From: jreback
s.values
gets you an object that has the __array__
protocol defined. This is not an API change, FYI also true of Sparse objects. So you can pass this to a function expecting an ndarray.
I suppose the doc-string could be updated.
In [10]: s = pd.Series(["a", "b", "c"], dtype="category")
In [11]: s.values.__array__
Out[11]:
<bound method Categorical.__array__ of [a, b, c]
Categories (3, object): [a < b < c]>
In [12]: np.array(s.values)
Out[12]: array(['a', 'b', 'c'], dtype=object)
Comment From: mwaskom
Well, ok, but that's not quite the same as giving you an object that has all the methods on an ndarray. I currently have code failing because it is trying to use the .size
attribute on the object that comes out of Series.values
.
Comment From: jreback
a categorical is NOT an ndarray, though it would act like it. It is simply not representable by an ndarray (its more like 2 ndarrays).
Comment From: mwaskom
I completely agree. Which is why I'm saying that from a user perspective, the behavior of Series.values
when the dtype of that series is category
is unexpected, and breaks an existing API.
Comment From: shoyer
I agree, the documentation here is broken. We should point users toward using np.asarray(s)
if they might have categorical or sparse values and want to be sure they get a numpy.ndarray
object.
Comment From: jreback
@mwaskom you can simply use .get_values()
which is an 'internal-ish', method which returns an ndarray always (it may coerce as you see below). The .values
has actually always been this ways (in that there are other dtypes that are backed by a non-ndarray).
In [3]: s.get_values()
Out[3]: array(['a', 'b', 'c'], dtype=object)
In [4]: s.astype('object').get_values()
Out[4]: array(['a', 'b', 'c'], dtype=object)
In [5]: Series([1,2,3]).get_values()
Out[5]: array([1, 2, 3])
Comment From: mwaskom
Thanks @jreback though I will probably not be using an "internal-ish" method in library code.
Comment From: jreback
i'll clarify
I meant unadvertised
. Its in the public-api.
Comment From: jorisvandenbossche
The docstring of .values
is updated in the meantime (https://github.com/pandas-dev/pandas/pull/10477/files#diff-150fd3c5a732ae915ec47bc54a933c41R328), so closing this.
The different output type is of course still confusing, but that is not something we are going to solve before pandas 2.