Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Let's start with:
ser = pd.Series([1, 2, 3, 4, 5], name='foo')
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 4, 5]}, index=[2,3,4])
Issue Description
ser.value_counts
:name
is from original Series,index
is unnameddf.value_counts
:name
isNone
,index
names from columns of original DataFramedf.groupby('a')['b'].value_counts()
:name
isb
, index names area
andb
df.groupby('a', as_index=False)['b'].value_counts()
: new column is'count'
, index is a newRangeIndex
Expected Behavior
My suggestion is:
ser.value_counts
:name
is'count'
,index
is name of original Seriesdf.value_counts
:name
is'count'
,index
names from columns of original DataFramedf.groupby('a')['b'].value_counts()
:name
is'count'
, index names area
andb
df.groupby('a', as_index=False)['b'].value_counts()
: new column is'count'
, index is a newRangeIndex
[same as now]
NOTE: if normalize=True
, then replace 'count'
with 'proportion'
Bug or API change?
Given that:
- df.groupby('a')['b'].value_counts().reset_index()
errors
- df.groupby('a')[['b']].value_counts().reset_index()
and df.groupby('a', as_index=False)[['b']].value_counts()
don't match
- multi_idx.value_counts()
doesn't preserve the multiindex's name anywhere
I'd be more inclined to consider it a bug and to make the change in 2.0
Installed Versions
Comment From: MarcoGorelli
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage in case anyone has thoughts - hoping we can consider this a bug and change the behaviour in version 2.0
Comment From: Dr-Irv
It could break existing code. In your examples, if you now do:
pd.DataFrame(ser.value_counts())
you get a DataFrame
with the column named foo
, so if someone knew that in there code, and did a rename
, their code would break, since your proposal would be to name the column in this case count
I think that we'd need a deprecation cycle for this???
Comment From: MarcoGorelli
True, it would break part of the API, but it would make the API more consistent
Not sure how to raise a deprecation warning for this - where would the warning be raised?
Comment From: rhshadrach
ser.value_counts
:name
is'count'
,index
is name of original Series
Isn't this inconsistent with other Series methods, e.g.
ser = pd.Series([1, 2, 3], name='foo')
print(ser.cumsum())
# 0 1
# 1 3
# 2 6
# Name: foo, dtype: int64
Comment From: MarcoGorelli
I don't think so, no - in cumsum
, the index is preserved, and the values get cumulatively summed. So it makes sense that the new Series keeps its name
In [5]: ser = pd.Series([1, 2, 3], name='foo', index=['a', 'b', 'c'])
In [6]: ser.cumsum()
Out[6]:
a 1
b 3
c 6
Name: foo, dtype: int64
In value_counts
, the Series unique values become the index - hence why it feels natural that the index should inherit the Series name
Comment From: mroeschke
I guess I could see an argument that value_counts
is an implicit groupby-count, so for the Series
case, the resulting index.name
should be ser.name
.
Not sure how I feel about an operation just introducing a name like "counts"/"normalized"
Comment From: MarcoGorelli
Isn't that what already happens in the groupby case though, where an extra column is inserted with the name 'count'
?
In [11]: df.groupby('a', as_index=False)['b'].value_counts()
Out[11]:
a b count
0 1 4 1
1 2 4 1
2 3 5 1
I find it odd that the corresponding as_index=True
case doesn't have that:
In [19]: df.groupby('a')['b'].value_counts()
Out[19]:
a b
1 4 1
2 4 1
3 5 1
Name: b, dtype: int64
not least because df.groupby('a')['b'].value_counts().reset_index()
will fail with ValueError: cannot insert b, already exists
.
If the name of the resulting series were 'count'
, then
- df.groupby('a')['b'].value_counts().reset_index()
and
- df.groupby('a', as_index=False)['b'].value_counts()
would be aligned
Comment From: Dr-Irv
One thing to consider is to add a parameter to value_counts()
that is called name
that defaults to None
(or "count"
in the one case where "count"
is used today) so that it corresponds to current behavior, and then we deprecate that default parameter changing to the word "count"
using your proposal.
Comment From: lukemanley
For indexes, the resulting Series.name
is taken from the input and the resulting Index.name
is None
.
pd.Index([1, 2, 3], name="foo").value_counts()
1 1
2 1
3 1
Name: foo, dtype: int64
This doesn't seem quite right as it cannot support a MultiIndex
that has multiple names:
pd.MultiIndex.from_tuples([(1, 1), (2, 2), (3, 3)], names=["foo", "bar"]).value_counts()
(1, 1) 1
(2, 2) 1
(3, 3) 1
dtype: int64
In the MultiIndex
case, the resulting Series.name
is None
. If the resulting names remained on the index, this would work for MultiIndex
as well.
(as a side note - I think MultiIndex.value_counts
should return a Series
indexed by a MultiIndex
not a flat index of tuples as shown in the example above. I will open a separate issue/PR for that)
Comment From: MarcoGorelli
That's a good point @lukemanley , thanks, I've added that to the issue description
@Dr-Irv true, there could be an extra parameter, but then - that parameter itself would probably need deprecating - IIUC, it wouldn't help to warn about the change in the index's name
I've added this to the agenda on Wednesday, hope we can touch base on it
Comment From: kuraga
Do I understand correctly that you are discussing .value_counts()
in isolation from functions of the same class?
Comment From: MarcoGorelli
Yes, this discussion is limited to value_counts
- I think it's exactly what your issue https://github.com/pandas-dev/pandas/issues/25795 was about
Comment From: rhshadrach
I don't think so, no - in
cumsum
, the index is preserved, and the values get cumulatively summed. So it makes sense that the new Series keeps its name
Currently all Series methods that return a Series keep the Series name (excluding Series.rename
:laughing:). This includes describe, mode, nlargest, and sample which don't keep the same index (though nlargest and sample are filters). This would be the only one that doesn't.
Comment From: MarcoGorelli
In Series.mode
, the original values still end up being part of the result values:
In [5]: ser
Out[5]:
a 1
b 1
c 3
Name: foo, dtype: int64
In [6]: ser.mode()
Out[6]:
0 1
Name: foo, dtype: int64
whereas in value_counts
, the original values become the index:
In [7]: ser.value_counts()
Out[7]:
1 2
3 1
Name: foo, dtype: int64
Hence why I think the original Series name should propagate to the result Index name, rather than to the result Series name
Comment From: Dr-Irv
That's a good point @lukemanley , thanks, I've added that to the issue description
@Dr-Irv true, there could be an extra parameter, but then
- that parameter itself would probably need deprecating
- IIUC, it wouldn't help to warn about the change in the index's name
I've added this to the agenda on Wednesday, hope we can touch base on it
It might be worth having the name
parameter anyway in case someone doesn’t want to use the word “count”
could also have a parameter for the index name as well.
value_counts()
is a bit of a pivot table operation so renaming the index and resulting column should be something a user can control.
Comment From: rhshadrach
It might be worth having the
name
parameter anyway in case someone doesn’t want to use the word “count”
Isn't this what .rename(...)
is for? In general, I'm against adding arguments to methods when they are equivalent to calling an already existing method. It makes our API surface larger, requires more testing, doesn't provide any additional features, and not significantly easier for users to use.
I think we should prefer method chaining in this case.
Comment From: Dr-Irv
It might be worth having the
name
parameter anyway in case someone doesn’t want to use the word “count”Isn't this what
.rename(...)
is for? In general, I'm against adding arguments to methods when they are equivalent to calling an already existing method. It makes our API surface larger, requires more testing, doesn't provide any additional features, and not significantly easier for users to use.I think we should prefer method chaining in this case.
I see your point. Still might be worth introducing it here to allow the current 1.x behavior to be preserved, then we deprecate the parameter later. Would require two deprecation cycles, one for the values of the default arguments and one for the new behavior.
As I said earlier, this is a kind of pivot table operation, so having control of the names does make sense. I don’t think we have other places in pandas where we create a default name like this.
Comment From: attack68
It might be worth having the
name
parameter anyway in case someone doesn’t want to use the word “count”Isn't this what
.rename(...)
is for? In general, I'm against adding arguments to methods when they are equivalent to calling an already existing method. It makes our API surface larger, requires more testing, doesn't provide any additional features, and not significantly easier for users to use.I think we should prefer method chaining in this case.
I agree. Much prefer reducing arguments in peripheral cases where method chaining does the job. Some actions are so common that arguments are useful to many end users ( and may not be avoidable) but chaining gives a greater understanding of how the api can be used in generality. I think 2.0 gives the freedom to redesign the api, and the principle of preserving the 1.x behaviour is a constraint that should be removed.
Comment From: jorisvandenbossche
Currently all Series methods that return a Series keep the Series name .... This would be the only one that doesn't.
I agree with @MarcoGorelli's point about that all those examples you give, the values in the series "stay" (in some form, filtered or reduced) in the series' values. While for value_counts, the values move to the index, and the values of the series are now counts.
To make this very explicit with a non-numerical example:
>>> df = pd.DataFrame({'species': ["cat", "cat", "dog"]})
>>> df["species"].value_counts()
cat 2
dog 1
Name: species, dtype: int64
>>> df["species"].value_counts().reset_index()
index species
0 cat 2
1 dog 1
After resetting the index, the column with the species names (the values in the original "species" column), is no longer in the "species" column.
I think this makes value_counts()
considerably different from other Series methods like mode
or sample
that preserve the Series' name.
I often tell people that value_counts
is comparable to doing a group by your column and checking the size of the groups (except for value_counts automatically sorting by counts):
>>> df.groupby("species").size().reset_index()
species 0
0 cat 2
1 dog 1
>>> df.groupby("species").size().rename("count").reset_index()
species count
0 cat 2
1 dog 1
With this method the original name of the column you are grouping by actually gets "correctly" set to the index (or grouping column, after a reset of the index).
Comment From: jorisvandenbossche
It might be worth having the
name
parameter anyway in case someone doesn’t want to use the word “count”
That seems like a nice potential intermediate solution. It could provide an easier migration path (although an actual deprecation would still be very noisy given how much value_counts
is used), and also gives easier access to the behaviour we might think is the better one. It could look like this:
# (hypothetical, not actual code with current pandas)
>>> df["species"].value_counts(name="counts").reset_index()
species count
0 cat 2
1 dog 1
Isn't this what
.rename(...)
is for? In general, I'm against adding arguments to methods when they are equivalent to calling an already existing method. It makes our API surface larger, requires more testing, doesn't provide any additional features, and not significantly easier for users to use.I think we should prefer method chaining in this case.
I certainly agree with the general principle. We should prefer and promote composability of methods. But just to be very concrete, using my example from above, I can think of the following options to do the renaming:
>>> df["species"].value_counts().rename("count").rename_axis("species").reset_index()
species count
0 cat 2
1 dog 1
>>> df["species"].value_counts().reset_index(name="count").rename(columns={"index": "species"})
species count
0 cat 2
1 dog 1
>>> df["species"].value_counts().reset_index().rename(columns={"species": "count", "index": "species"})
species count
0 cat 2
1 dog 1
(I am adding a reset_index
for each of the cases to have an easier interpretable dataframe as result)
So it's not just an easy single "rename" in addition to value_counts()
(since the Series' name is preserved, just renaming the Index would give a duplicate name, so you need to rename both)
Given that this is not that straightforward to achieve, I think practicality could beat purity here, and an additional keyword argument could be worth it (I find a potential df["species"].value_counts(name="counts").reset_index()
clearer and more readable than the current options above)
Comment From: rhshadrach
I'm persuaded by the arguments of @MarcoGorelli and @jorisvandenbossche, and find the df.groupby("species").size().rename("count").reset_index()
example particularly compelling. I'm +1 on the desired behavior in the OP.
Regarding adding the name
argument, I would be okay with adding if it is to be deprecated after the change is made. I think we should also consider making this API breaking change in 2.0 without a deprecation. This would be easier from a development perspective, and my experience is that value_counts
is primarily used for ad-hoc analysis where changing the name wouldn't break anything.
Comment From: MarcoGorelli
Thanks all for the discussion, and really nice example from Joris!
Regarding adding a name
parameter to enforce the deprecation - would the idea be to also add index_name
, and have them emit a FutureWarning
the user doesn't pass anything notifying them that the default will change in a future version of pandas? If so, I worry that that would be extremely noisy. So
I think we should also consider making this API breaking change in 2.0 without a deprecation
makes sense to me.
my experience is that value_counts is primarily used for ad-hoc analysis where changing the name wouldn't break anything
Same, just confirmed this by scraping Kaggle notebooks and looking through them, value_counts
is usually part of the EDA rather than part of the model pipeline. One very popular notebook, as part of their EDA, actually does
feature_count = train[i].value_counts(dropna=False)[:40].reset_index().rename(columns={i: 'count', 'index': i})
so it seems that they also find the current behaviour unexpected
Comment From: jbrockmendel
+1 on the suggested behavior, -0.5 on adding a new keyword
Comment From: Dr-Irv
Regarding adding a
name
parameter to enforce the deprecation - would the idea be to also addindex_name
, and have them emit aFutureWarning
the user doesn't pass anything notifying them that the default will change in a future version of pandas? If so, I worry that that would be extremely noisy. SoI think we should also consider making this API breaking change in 2.0 without a deprecation
makes sense to me.
If we do the API breaking change (and it's really a behavior change), I think we should introduce the two new params (name
and index_name
) so that the old behavior can be obtained by using the new arguments, and by following @jorisvandenbossche examples, it makes user code cleaner anyway.
my experience is that value_counts is primarily used for ad-hoc analysis where changing the name wouldn't break anything
I checked some of our code, and we have used value_counts()
in production code outside of notebooks, not just for analysis. I did check it in a few places, and I don't think the proposed changes would break our code, but we shouldn't assume that it is only used for ad-hoc analysis.
Comment From: jbrockmendel
introduce the two new params (name and index_name) so that the old behavior can be obtained by using the new arguments
if we're telling users "Do X to get the old behavior", why can't the "X" be "use rename(...)" so we don't have to add new arguments?
Comment From: Dr-Irv
introduce the two new params (name and index_name) so that the old behavior can be obtained by using the new arguments
if we're telling users "Do X to get the old behavior", why can't the "X" be "use rename(...)" so we don't have to add new arguments?
Fair enough. Having said that, I think that the examples given by @jorisvandenbossche in https://github.com/pandas-dev/pandas/issues/49497#issuecomment-1304566796 suggest that the new arguments would be a good thing.
Comment From: jorisvandenbossche
@Dr-Irv what would be the rationale for also having index_name
? (except for exactly getting back the old behavior, which would be to set it to None (index_name=None
), but I am not sure how useful this is?)
I would think that generally the original Series name is an appropriate name for the resulting index, without a strong need to more easily rename it than using the existing renaming methods (while for a name
argument, that would be to control a new column name, where it seems more logical that the user might want to choose this name)
FYI, I just noticed that R's tally method uses "n" instead of "count" as the resulting column name, so that's also an option. Although I think for us "count" is fine given that this resembles the method name.
Comment From: Dr-Irv
@Dr-Irv what would be the rationale for also having
index_name
? (except for exactly getting back the old behavior, which would be to set it to None (index_name=None
), but I am not sure how useful this is?)
Maybe it is because of symmetry with naming the columns, so if we have name
, it feels right to have index_name
from a symmetry perspective. Having said that, I don't think it is as important of adding name
Comment From: MarcoGorelli
Although I think for us "count" is fine given that this resembles the method name.
and also given that that's what pandas already does in the groupby (as_index=False) case:
In [1]: df = pd.DataFrame({'animal': ['chinchilla', 'lemur', 'quetzel'], 'length': [4.22, 3.08, .34]})
In [2]: df.groupby('animal', as_index=False).value_counts()
Out[2]:
animal length count
0 chinchilla 4.22 1
1 lemur 3.08 1
2 quetzel 0.34 1
Regarding adding extra arguments, I'd be -1 on that: using .rename
and .rename_axis
looks just as clean to me
Comment From: MarcoGorelli
From the last pandas-dev call, it was agreed that this change is desirable, but there's no agreement on how to make the change. I can see the following options:
1. always raise a warning on .value_counts
calls
2. add name
argument, raise warning if name
is None
3. add name
and index_name
arguments, raise warning if either is None
4. change in 2.0.0 without runtime warning
In short, I'd favour option 4. I'll go through the options one-by-one, explaining drawbacks I see:
1. always raise a warning on .value_counts
calls
I tried this in #49640, and it'd be really noisy. In particular, for df.groupby(col).apply(lambda df: df.value_counts)
the warning would be raised repeatedly, with no way (other than warnings.filterwarnings
, which I don't think we should encourage) to turn it off
2. add name
argument, raise warning if name
is None
My issue is - what would the warning be? Warnings are usually of the kind "in a future version of pandas, x
will change behaviour. To accept the new behaviour, please write ...
instead", but what could be written here?
"""
In pandas 2.0.0, the name of the resulting Series will be 'count'
(or 'proportion' if `normalize=True`), and the index will inherit
the original object's name. To accept the new behaviour and
silence this warning, please replace
.value_counts()
with
.pipe(lambda ser: ser.value_counts(name='count').rename_axis(ser.name))
"""
What I'm not keen on is that:
- the "accept the new behaviour" code would be quite complex, and it wouldn't even work in general (for example if the name of the original Series was a tuple)
- ser.value_counts(name='foo')
would silence the warning, but would then still change in 2.0.0 because the index would get renamed
3. add name
and index_name
arguments
Same drawback as above, the FutureWarning
would have to show something like
"""
In pandas 2.0.0, the name of the resulting Series will be 'count'
(or 'proportion' if `normalize=True`), and the index will inherit
the original object's name. To accept the new behaviour and
silence this warning, please replace
.value_counts()
with
.pipe(lambda ser: ser.value_counts(name='count', index_name=ser.name))
"""
Furthermore, this would add even more arguments
4. change in 2.0.0 without runtime warning
Just change. df.groupby(as_index=False).value_counts()
already has the desired behaviour, so this could be considered a bug fix and everything else isn't aligned with this.
For visibility, this could be noted is the "notable bug fixes" section.
The drawback here is that users may need to update their code - however, they'll almost certainly have to do that anyway given the number of changes in version 2.0.0, and if this was documented as one of the most visible changes, then users would probably see it
I'm not opposed to adding warnings / extra arguments, but I think it needs to have a warning of what users can do to make their code future-proof and accept the new behaviour - anyone have ideas for how to do this?
Anyway, this why I'd favour option 4
Comment From: WillAyd
Also +1 to option 4
Comment From: jreback
+1 to option 4 ; option 3 wouldn't object (but also noisy)
Comment From: attack68
+1 option 4. option 2 and 3 I'm somewhat indifferent to. - seeing merits and drawbacks.
Comment From: jorisvandenbossche
Some feedback on the specifics of the different options:
1. always raise a warning on
.value_counts
calls ... In particular, fordf.groupby(col).apply(lambda df: df.value_counts)
the warning would be raised repeatedly,
Small note, for this specific case users can (should?) do df.groupby(col).value_counts()
instead of through apply
, and then you won't have the warning repeatedly.
(it will of course still be noisy in general)
2. add
name
argument, raise warning ifname
is NoneMy issue is - what would the warning be? "To accept the new behaviour and silence this warning, please replace
.value_counts()
with.pipe(lambda ser: ser.value_counts(name='count', index_name=ser.name))
I think this could be simpler: replace it with .value_counts(name="count")
to get the future behaviour.
Assuming we would, if name
is specified, also already use the new behaviour for setting the index name (so this is different from what you were proposing). I think it is perfectly fine to do this, since the default of name=None
wouldn't do this and so that would only affect behaviour if the user actually specifies a name
explicitly.
4. change in 2.0.0 without runtime warning
Just change.
df.groupby(as_index=False).value_counts()
already has the desired behaviour, so this could be considered a bug fix and everything else isn't aligned with this. For visibility, this could be noted is the "notable bug fixes" section.
If we do this (and to be clear I am not opposed to this), this is IMO clearly not a bug fix, rather it's an intentional breaking change of long standing behaviour. But we can decide to do that if we want for a major release, no need to justify this to ourselves as a bug fix I would say ;)
I'm not opposed to adding warnings / extra arguments, but I think it needs to have a warning of what users can do to make their code future-proof and accept the new behaviour - anyone have ideas for how to do this?
See above, I am convinced we can provide an easy way to get the future behaviour with this new keyword (and in a way that could keep working the same in the future). Of course, that doesn't make it less noisy, but IMO certainly doable.
Just to throw out one more variant that I mentioned briefly on the call: kind of a 2b: still add a warning in 1.5.x, but only a DeprecationWarning (and not FutureWarning, so end users won't see it this was relied upon in actual code (while the authors of that code can still see it and fix it), but will only see it in interactive exploration that calls it directly), and change it in 2.0.0. This gives still some chance to code authors to already update their code in advance of 2.0, while keeping the period that users see those noisy warnings relatively short.
Comment From: Dr-Irv
I have another idea. What if we did something like ask users to do from pandas import __future__
when they want the future behavior, which then implements the future behavior, and also silences the warning. Kind of like from __future__ import annotations
that is used in typing.
So then you could do option 2 or 3, and there is an easy way to both get the future behavior AND silence the warnings, and you can use from pandas import __future__
to silence the warnings, and use the new name
and index_name
arguments to get the current behavior.
Comment From: jorisvandenbossche
To enable the behaviour globally, instead of an import, it could also be done as an option (that might be more consistent with other cases where we use options to enable a certain behaviour)
Comment From: MarcoGorelli
I do like the sound of that more, thanks Irv and Joris!
So a warning could be something like
"""
The value_counts API will change in pandas 2.0.0: the result name will be
'count' (or 'proportion') if `normalize=True`, and the index name will be
take from the original object's name(s).
To accept the new behaviour and silence this warning, set:
pd.set_option('mode.value_counts_v2')
"""
And this could go in to 1.5.3
Comment From: jreback
I don't think adding a way to silence this via global option (sure could be done via context manager as well) is actually helpful
The problem as a user is you see a warning and you just want to adapt your code so that it works under the current and new versions. Here you just now make 2 changes - set the option and then remove setting the option later.
I don't believe we have this for any other deprecation (or anything really) and hence would be -1 on 2b
+1 on option 4 (just change it for 2.0)
Comment From: Dr-Irv
The problem as a user is you see a warning and you just want to adapt your code so that it works under the current and new versions. Here you just now make 2 changes - set the option and then remove setting the option later.
But in version 2.0, the setting would be a NO-OP. So you could leave it in your code and nothing would happen.
Comment From: bashtage
Given the breadth of 2, I think it is also reasonable to go with option 4.
Comment From: MarcoGorelli
But in version 2.0, the setting would be a NO-OP. So you could leave it in your code and nothing would happen.
Thinking about this further, I think this is the suggestion I like the most. In pandas 2.0.0, the option becomes a no-op, people can remove it but they can also safely keep it, and there needn't be any hurry from the pandas side to deprecate it.
I'm putting together a PDEP for this (both for visibility, and because - as discussed in the governance meeting - it's the kind of change that probably warrants one)
Comment From: jbrockmendel
I don’t like the precedent that api changes of this size/scope merit a PDEP.
On Thu, Nov 24, 2022 at 12:06 PM Marco Edward Gorelli ***@***.*** wrote:
But in version 2.0, the setting would be a NO-OP. So you could leave it in your code and nothing would happen.
Thinking about this further, I think this is the suggestion I like the most. In pandas 2.0.0, the option becomes a no-op, people can remove it but they can also safely keep it, and there needn't be any hurry from the pandas side to deprecate it.
I'm putting together a PDEP for this (both for visibility, and because - as discussed in the governance meeting - it's the kind of change that probably warrants one)
— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/49497#issuecomment-1326808506, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5UM6FE6JD3UQM36XLGPP3WJ7DFZANCNFSM6AAAAAARWCAIS4 . You are receiving this because you are on a team that was mentioned.Message ID: @.***>
Comment From: MarcoGorelli
That's fair - let's keep the discussion here then
Looking back through this thread, the option that's received the most support (and also no strong objections) is to "just change" (option 4), so I'll raise a PR to do that - if people do have strong objections, they can block the PR and we can explore other options