-
[X] I have searched the [pandas] tag on StackOverflow for similar questions.
-
[x] I have asked my usage related question on StackOverflow.
Question about pandas
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df.A = df.agg('City_{0[A]}'.format, axis=1)
df.index = (df.index)*int(np.random.randint(6,23)**0.5)+int(np.random.randint(2,23))
This is my df:
A | B | C | D |
---|---|---|---|
City_2 | 3 | 7 | 3 |
City_0 | 4 | 8 | 9 |
City_1 | 1 | 2 | 1 |
City_5 | 5 | 0 | 9 |
City_5 | 1 | 6 | 0 |
City_0 | 3 | 8 | 6 |
City_7 | 2 | 6 | 6 |
City_1 | 6 | 0 | 2 |
City_8 | 6 | 2 | 4 |
City_2 | 2 | 5 | 6 |
If I do
df.groupby([df.index//5,df.A], as_index=True).mean()
or
df.groupby([df.index//5,df.A]).mean()
Result:
B | C | D | ||
---|---|---|---|---|
A | ||||
1 | City_2 | 3 | 7 | 3 |
2 | City_0 | 4 | 8 | 9 |
3 | City_1 | 1 | 2 | 1 |
City_5 | 5 | 0 | 9 | |
4 | City_0 | 3 | 8 | 6 |
City_5 | 1 | 6 | 0 | |
5 | City_7 | 2 | 6 | 6 |
6 | City_1 | 6 | 0 | 2 |
City_8 | 6 | 2 | 4 | |
7 | City_2 | 2 | 5 | 6 |
So If do .reset_index()
I will get this (that index with no column name will get name as level_0
):
level_0 | A | B | C | D |
---|---|---|---|---|
1 | City_2 | 3 | 7 | 3 |
2 | City_0 | 4 | 8 | 9 |
3 | City_1 | 1 | 2 | 1 |
3 | City_5 | 5 | 0 | 9 |
4 | City_0 | 3 | 8 | 6 |
4 | City_5 | 1 | 6 | 0 |
5 | City_7 | 2 | 6 | 6 |
6 | City_1 | 6 | 0 | 2 |
6 | City_8 | 6 | 2 | 4 |
7 | City_2 | 2 | 5 | 6 |
The same is not with the as_index=False
df.groupby([df.index//5,df.A], as_index=False).mean()
will give me
A | B | C | D |
---|---|---|---|
City_2 | 3 | 7 | 3 |
City_0 | 4 | 8 | 9 |
City_1 | 1 | 2 | 1 |
City_5 | 5 | 0 | 9 |
City_0 | 3 | 8 | 6 |
City_5 | 1 | 6 | 0 |
City_7 | 2 | 6 | 6 |
City_1 | 6 | 0 | 2 |
City_8 | 6 | 2 | 4 |
City_2 | 2 | 5 | 6 |
I will lose the groupby index info
.
My question:
shouldn't be as_index=False
give df.index//5
column also. Or as_index
is designed to work like this only.
Comment From: ParfaitG
Possibly, the issue is that df.index//5
is not actually a column in the data frame but an external object. Whereas reset_index
runs after groupby
, as_index
interacts with existing columns in data frame that potentially become indexes. To fix, consider assigning calculated object as a new column before groupby
.
(Personally, I always call groupby
on named columns. Additionally, I also avoid open arithmetic operators but use the functional form (Series.add
, Series.mul
, Series.floordiv
, Series.pow
, etc.) and avoid period referencing of columns: df['A']
.)
```python agg = ( df.assign(multi5 = np.floor_divide(df.index, 5)) # functional form of df.index//5 .groupby(['multi5', 'A'], as_index=False) .mean() )
agg # below output to differ w/o np.random.seed(###) ````
multi5 | A | B | C | D | |
---|---|---|---|---|---|
0 | 2 | City_5 | 7 | 2 | 9 |
1 | 3 | City_2 | 0 | 5 | 1 |
2 | 3 | City_3 | 2 | 9 | 3 |
3 | 4 | City_2 | 2 | 9 | 5 |
4 | 4 | City_9 | 1 | 8 | 9 |
5 | 5 | City_9 | 7 | 3 | 1 |
6 | 6 | City_3 | 4 | 0 | 8 |
7 | 6 | City_7 | 9 | 8 | 8 |
8 | 7 | City_6 | 6 | 5 | 8 |
9 | 7 | City_7 | 3 | 3 | 9 |
Comment From: rhshadrach
Thanks @ninjakx for the report! From the groupby userguide docs:
Passing as_index=False will return the groups that you are aggregating over, if they are named columns.
Perhaps this could be added to the API docs as well.
Comment From: jorisvandenbossche
The full note in the documentation says:
Aggregation functions will not return the groups that you are aggregating over if they are named columns, when
as_index=True
, the default. The grouped columns will be the indices of the returned object.Passing
as_index=False
will return the groups that you are aggregating over, if they are named columns.
I find this not very clear. Also the part about as_index=True
uses the phrase "if they are named columns", but in that case the group keys are always set as the index, no? (regardless whether they are present as column or not, or whether they are named or not)
So from that you could assume that the same condition holds for as_index=False
. In addition it is not also not super explicit what a "named column" is (one could think of a Series with a name also as a "named column", it's not clear whether this is about being "present" in the calling dataframe)
Comment From: jorisvandenbossche
And regardless of the (unclear) documentation, I also don't understand the rationale of this behaviour if it is intentional. Why wouldn't we always make the group values a column with as_index=False
?
Comment From: rhshadrach
Closing in favor of #49519