Pandas Behavior of Series.values when dtype is "category" is surprising

Say I make a category-type Series:

s = pd.Series(["a", "b", "c"], dtype="category")

I want to pass this to a function that expects a numpy array, so I use the values property. According to the documentation:

s.values?

Type:        property
String form: <property object at 0x106d84050>
Docstring:
Return Series as ndarray

Returns
-------
arr : numpy.ndarray

However:

s.values

[a, b, c]
Categories (3, object): [a < b < c]

Comment From: mwaskom

Aside from the documentation issue, I thought that using .values was the canonical way to represent a Series as an ndarray, so this is potentially a problematic API change.

Comment From: jreback

s.values gets you an object that has the __array__ protocol defined. This is not an API change, FYI also true of Sparse objects. So you can pass this to a function expecting an ndarray.

I suppose the doc-string could be updated.

In [10]: s = pd.Series(["a", "b", "c"], dtype="category")

In [11]: s.values.__array__
Out[11]:
<bound method Categorical.__array__ of [a, b, c]
Categories (3, object): [a < b < c]>

In [12]: np.array(s.values)
Out[12]: array(['a', 'b', 'c'], dtype=object)

Comment From: mwaskom

Well, ok, but that's not quite the same as giving you an object that has all the methods on an ndarray. I currently have code failing because it is trying to use the .size attribute on the object that comes out of Series.values.

Comment From: jreback

a categorical is NOT an ndarray, though it would act like it. It is simply not representable by an ndarray (its more like 2 ndarrays).

Comment From: mwaskom

I completely agree. Which is why I'm saying that from a user perspective, the behavior of Series.values when the dtype of that series is category is unexpected, and breaks an existing API.

Comment From: shoyer

I agree, the documentation here is broken. We should point users toward using np.asarray(s) if they might have categorical or sparse values and want to be sure they get a numpy.ndarray object.

Comment From: jreback

@mwaskom you can simply use .get_values() which is an 'internal-ish', method which returns an ndarray always (it may coerce as you see below). The .values has actually always been this ways (in that there are other dtypes that are backed by a non-ndarray).

In [3]: s.get_values()
Out[3]: array(['a', 'b', 'c'], dtype=object)

In [4]: s.astype('object').get_values()
Out[4]: array(['a', 'b', 'c'], dtype=object)

In [5]: Series([1,2,3]).get_values()
Out[5]: array([1, 2, 3])

Comment From: mwaskom

Thanks @jreback though I will probably not be using an "internal-ish" method in library code.

Comment From: jreback

i'll clarify

I meant unadvertised. Its in the public-api.

Comment From: jorisvandenbossche

The docstring of .values is updated in the meantime (https://github.com/pandas-dev/pandas/pull/10477/files#diff-150fd3c5a732ae915ec47bc54a933c41R328), so closing this. The different output type is of course still confusing, but that is not something we are going to solve before pandas 2.