Pandas inconsistent name handling in value_counts, part 2

I was happy to see in the release notes for 0.17.0 that value_counts no longer discards the series name, but the implementation wasn't what I expected.

0.17.0 gives

>>> series = pd.Series([1731, 364, 813, 1731], name='user_id')
>>> series.value_counts()
1731    2
813     1
364     1
Name: user_id, dtype: int64

which doesn't set the index name.

In my opinion the old series name belongs in the index, not in the series name:

>>> series.value_counts()
user_id
1731    2
813     1
364     1
dtype: int64

Why: - It's logical: the user_id has moved to the index, and the values now represent occurrence counts - This would be consistent with how .groupby().size() behaves - Adding a missing index name is cumbersome and requires creating a temporary variable - In many cases the series name is discarded, while index names tend to stick around: for example, pd.DataFrame({'n': series.value_counts(), 'has_duplicates': series.value_counts() > 1}) should really have user_id as an index name

There are three options: - result.name = None and result.index.name = series.name - result.name = series.name and result.index.name = series.name - result.name = 'size' or 'count' and result.index.name = series.name

The first option seems more elegant to me but @sinhrks, who reported #10150, apparently expected result.name to be filled, so perhaps there are use cases where the second option is useful.

Comment From: jreback

Can you show where you think .groupby preserves? AFAICT it does not.

In [111]: s = pd.Series([1731, 364, 813, 1731], name='user_id', index=Index(range(4),name='foo'))

In [112]: s
Out[112]: 
foo
0    1731
1     364
2     813
3    1731
Name: user_id, dtype: int64

In [113]: s.groupby(s.values).count()           
Out[113]: 
364     1
813     1
1731    2
dtype: int64

In [114]: s.value_counts()
Out[114]: 
1731    2
813     1
364     1
Name: user_id, dtype: int64

Comment From: sinhrks

@corr724 #10150 isn't the same as your second option. It is: - result.name = series.name and result.index.name = None

Agreed to first option is consistent considering groupby().count().

s.groupby(s).count().name
# None
s.groupby(s).count().index.name
# 'user_id'

Comment From: corr724

Yes, the three options are all ways to change the current behavior. The second option was intended as a compromise between the first option and the change in #10150.

@jreback: That's because s.values discards all Pandas metadata, but using either s.groupby(s) or df.groupby('column') will set the index name.

Comment From: jreback

ok, that seems reasonable then.

I think the option 1 is fine (though ok with name=size/count as well, your option 3), just should make sure that groupby.size/count is the same.

Comment From: lukemanley

closing as this was implemented in #49912