I was happy to see in the release notes for 0.17.0 that value_counts no longer discards the series name, but the implementation wasn't what I expected.
0.17.0 gives
>>> series = pd.Series([1731, 364, 813, 1731], name='user_id')
>>> series.value_counts()
1731 2
813 1
364 1
Name: user_id, dtype: int64
which doesn't set the index name.
In my opinion the old series name belongs in the index, not in the series name:
>>> series.value_counts()
user_id
1731 2
813 1
364 1
dtype: int64
Why:
- It's logical: the user_id has moved to the index, and the values now represent occurrence counts
- This would be consistent with how .groupby().size()
behaves
- Adding a missing index name is cumbersome and requires creating a temporary variable
- In many cases the series name is discarded, while index names tend to stick around: for example, pd.DataFrame({'n': series.value_counts(), 'has_duplicates': series.value_counts() > 1})
should really have user_id as an index name
There are three options: - result.name = None and result.index.name = series.name - result.name = series.name and result.index.name = series.name - result.name = 'size' or 'count' and result.index.name = series.name
The first option seems more elegant to me but @sinhrks, who reported #10150, apparently expected result.name to be filled, so perhaps there are use cases where the second option is useful.
Comment From: jreback
Can you show where you think .groupby
preserves? AFAICT it does not.
In [111]: s = pd.Series([1731, 364, 813, 1731], name='user_id', index=Index(range(4),name='foo'))
In [112]: s
Out[112]:
foo
0 1731
1 364
2 813
3 1731
Name: user_id, dtype: int64
In [113]: s.groupby(s.values).count()
Out[113]:
364 1
813 1
1731 2
dtype: int64
In [114]: s.value_counts()
Out[114]:
1731 2
813 1
364 1
Name: user_id, dtype: int64
Comment From: sinhrks
@corr724 #10150 isn't the same as your second option. It is: - result.name = series.name and result.index.name = None
Agreed to first option is consistent considering groupby().count()
.
s.groupby(s).count().name
# None
s.groupby(s).count().index.name
# 'user_id'
Comment From: corr724
Yes, the three options are all ways to change the current behavior. The second option was intended as a compromise between the first option and the change in #10150.
@jreback: That's because s.values
discards all Pandas metadata, but using either s.groupby(s)
or df.groupby('column')
will set the index name.
Comment From: jreback
ok, that seems reasonable then.
I think the option 1 is fine (though ok with name=size/count
as well, your option 3), just should make sure that groupby.size/count
is the same.
Comment From: lukemanley
closing as this was implemented in #49912