Pandas API: value_counts with nullable dtype should return np.int64 like everything else

ATM these are special-cased to return Int64 (i.e. nullable) instead of np.int64. But the result of value_counts will never have any NAs, so there is no benefit. It complicates the code, complicates the API, and prevents us from sharing tests.

These should return np.int64 dtype like everything else.

Comment From: jorisvandenbossche

The return value of value_counts indeed never contains NAs, but further operations with this result still can introduce missing values. And then the return type of value_counts does matter.

The nullable dtypes are optional, but once opting in, we should IMO keep using them as much as possible for results (eg also fillna keeps the nullable dtype, although the result has no NAs)

Comment From: jreback

i agree with @jbrockmendel here. if we know that a type is int64 by-definition / always. I don't think we should just return it for this opearation. The simplication argument is persuasive.

Comment From: jorisvandenbossche

It is by definition an integer dtype with 64 bitwidth, but whether we choose Int64 or int64 is just an API choice (Int64 is also an int64 dtype). There are quite some other operations where we preserve nullable dtypes, even if there are no missing values.

As mentioned above, while the return value of value_counts itself doesn't have missing values, follow-up operations in your pipeline could introduce them in which case the actual dtype makes a difference. A small example:

>>> s = pd.Series([1, 2, 1, 2, 4], dtype="Int64")
>>> s
0    1
1    2
2    1
3    2
4    4
dtype: Int64
>>> s.value_counts().reindex(list(range(5)))
0    <NA>
1       2
2       2
3    <NA>
4       1
dtype: Int64
>>> s.value_counts().reindex(list(range(5))).fillna(0)
0    0
1    2
2    2
3    0
4    1
dtype: Int64

If we return numpy.int64, the last two examples would be float data.

IMO, when people choose to use a nullable dtype, we should preserve as much as possible the "nullability" in operations, so have type stability for this aspect of the type.

Comment From: jreback

IMO, when people choose to use a nullable dtype, we should preserve as much as possible the "nullability" in operations, so have type stability for this aspect of the type.

sure but this point is not relevant

what is relevant is that we should just pick a return type

it's pretty crazy that the output type is different here

so we need to either pick int64 or Int64 for the return value always

Comment From: mroeschke

I could see there being a consistency argument to return Int64 if there's a strong push to make the nullable numeric types the return type for all pandas operations eventually.

If not (or not in the near future), I think the simplicity of return np.int64 makes sense and the option of casting with astype("Int64") post value_counts could be left to the user.

Comment From: rhshadrach

Once a user has opted into nullable dtypes, it feels expected to me for pandas to continue to use nullable dtypes even if it doesn't have to (e.g. fillna). I think this is the way most ops work although admittedly, many of them can have a result with null values and so maybe "shouldn't count". By always returning int64, it seems to me we'd be creating special-cased behavior because of the peculiarities of the op itself, something which users may find surprising.

I do agree that we can have a user cast back to nullable dtypes if necessary, but I don't think of this as a preferred solution. It makes trying to use nullable dtypes more of a hassel.

Comment From: mroeschke

This looks to be the PR & discussion where Int64 return type was made: https://github.com/pandas-dev/pandas/pull/30824

It appears the motivation was to promote the new nullable dtype back in 1.0

Comment From: jbrockmendel

closing as never-gonna-happen