ATM these are special-cased to return Int64 (i.e. nullable) instead of np.int64. But the result of value_counts will never have any NAs, so there is no benefit. It complicates the code, complicates the API, and prevents us from sharing tests.
These should return np.int64 dtype like everything else.
Comment From: jorisvandenbossche
The return value of value_counts
indeed never contains NAs, but further operations with this result still can introduce missing values. And then the return type of value_counts
does matter.
The nullable dtypes are optional, but once opting in, we should IMO keep using them as much as possible for results (eg also fillna keeps the nullable dtype, although the result has no NAs)
Comment From: jreback
i agree with @jbrockmendel here. if we know that a type is int64 by-definition / always. I don't think we should just return it for this opearation. The simplication argument is persuasive.
Comment From: jorisvandenbossche
It is by definition an integer dtype with 64 bitwidth, but whether we choose Int64
or int64
is just an API choice (Int64
is also an int64 dtype).
There are quite some other operations where we preserve nullable dtypes, even if there are no missing values.
As mentioned above, while the return value of value_counts
itself doesn't have missing values, follow-up operations in your pipeline could introduce them in which case the actual dtype makes a difference. A small example:
>>> s = pd.Series([1, 2, 1, 2, 4], dtype="Int64")
>>> s
0 1
1 2
2 1
3 2
4 4
dtype: Int64
>>> s.value_counts().reindex(list(range(5)))
0 <NA>
1 2
2 2
3 <NA>
4 1
dtype: Int64
>>> s.value_counts().reindex(list(range(5))).fillna(0)
0 0
1 2
2 2
3 0
4 1
dtype: Int64
If we return numpy.int64, the last two examples would be float data.
IMO, when people choose to use a nullable dtype, we should preserve as much as possible the "nullability" in operations, so have type stability for this aspect of the type.
Comment From: jreback
IMO, when people choose to use a nullable dtype, we should preserve as much as possible the "nullability" in operations, so have type stability for this aspect of the type.
sure but this point is not relevant
what is relevant is that we should just pick a return type
it's pretty crazy that the output type is different here
so we need to either pick int64 or Int64 for the return value always
Comment From: mroeschke
I could see there being a consistency argument to return Int64 if there's a strong push to make the nullable numeric types the return type for all pandas operations eventually.
If not (or not in the near future), I think the simplicity of return np.int64
makes sense and the option of casting with astype("Int64")
post value_counts
could be left to the user.
Comment From: rhshadrach
Once a user has opted into nullable dtypes, it feels expected to me for pandas to continue to use nullable dtypes even if it doesn't have to (e.g. fillna). I think this is the way most ops work although admittedly, many of them can have a result with null values and so maybe "shouldn't count". By always returning int64, it seems to me we'd be creating special-cased behavior because of the peculiarities of the op itself, something which users may find surprising.
I do agree that we can have a user cast back to nullable dtypes if necessary, but I don't think of this as a preferred solution. It makes trying to use nullable dtypes more of a hassel.
Comment From: mroeschke
This looks to be the PR & discussion where Int64 return type was made: https://github.com/pandas-dev/pandas/pull/30824
It appears the motivation was to promote the new nullable dtype back in 1.0
Comment From: jbrockmendel
closing as never-gonna-happen