• [X] I have searched the [pandas] tag on StackOverflow for similar questions.

  • [x] I have asked my usage related question on StackOverflow.


Question about pandas

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df.A = df.agg('City_{0[A]}'.format, axis=1)
df.index = (df.index)*int(np.random.randint(6,23)**0.5)+int(np.random.randint(2,23))

This is my df:

 A B C D
City_2 3 7 3
City_0 4 8 9
City_1 1 2 1
City_5 5 0 9
City_5 1 6 0
City_0 3 8 6
City_7 2 6 6
City_1 6 0 2
City_8 6 2 4
City_2 2 5 6

If I do

df.groupby([df.index//5,df.A], as_index=True).mean() or df.groupby([df.index//5,df.A]).mean()

Result:

B C D
A
1 City_2 3 7 3
2 City_0 4 8 9
3 City_1 1 2 1
City_5 5 0 9
4 City_0 3 8 6
City_5 1 6 0
5 City_7 2 6 6
6 City_1 6 0 2
City_8 6 2 4
7 City_2 2 5 6

So If do .reset_index()

I will get this (that index with no column name will get name as level_0 ):

  level_0 A B C D
1 City_2 3 7 3
2 City_0 4 8 9
3 City_1 1 2 1
3 City_5 5 0 9
4 City_0 3 8 6
4 City_5 1 6 0
5 City_7 2 6 6
6 City_1 6 0 2
6 City_8 6 2 4
7 City_2 2 5 6

The same is not with the as_index=False df.groupby([df.index//5,df.A], as_index=False).mean() will give me

 A B C D
City_2 3 7 3
City_0 4 8 9
City_1 1 2 1
City_5 5 0 9
City_0 3 8 6
City_5 1 6 0
City_7 2 6 6
City_1 6 0 2
City_8 6 2 4
City_2 2 5 6

I will lose the groupby index info.

My question: shouldn't be as_index=False give df.index//5 column also. Or as_index is designed to work like this only.

Comment From: ParfaitG

Possibly, the issue is that df.index//5 is not actually a column in the data frame but an external object. Whereas reset_index runs after groupby, as_index interacts with existing columns in data frame that potentially become indexes. To fix, consider assigning calculated object as a new column before groupby.

(Personally, I always call groupby on named columns. Additionally, I also avoid open arithmetic operators but use the functional form (Series.add, Series.mul, Series.floordiv, Series.pow, etc.) and avoid period referencing of columns: df['A'].)

```python agg = ( df.assign(multi5 = np.floor_divide(df.index, 5)) # functional form of df.index//5 .groupby(['multi5', 'A'], as_index=False) .mean() )

agg # below output to differ w/o np.random.seed(###) ````

multi5 A B C D
0 2 City_5 7 2 9
1 3 City_2 0 5 1
2 3 City_3 2 9 3
3 4 City_2 2 9 5
4 4 City_9 1 8 9
5 5 City_9 7 3 1
6 6 City_3 4 0 8
7 6 City_7 9 8 8
8 7 City_6 6 5 8
9 7 City_7 3 3 9

Comment From: rhshadrach

Thanks @ninjakx for the report! From the groupby userguide docs:

Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

Perhaps this could be added to the API docs as well.

Comment From: jorisvandenbossche

The full note in the documentation says:

Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.

Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

I find this not very clear. Also the part about as_index=True uses the phrase "if they are named columns", but in that case the group keys are always set as the index, no? (regardless whether they are present as column or not, or whether they are named or not) So from that you could assume that the same condition holds for as_index=False. In addition it is not also not super explicit what a "named column" is (one could think of a Series with a name also as a "named column", it's not clear whether this is about being "present" in the calling dataframe)

Comment From: jorisvandenbossche

And regardless of the (unclear) documentation, I also don't understand the rationale of this behaviour if it is intentional. Why wouldn't we always make the group values a column with as_index=False?

Comment From: rhshadrach

Closing in favor of #49519