Pandas Groupby on date component results in corrupt DataFrame

Code Sample, a copy-pastable example if possible

In [3]: df1 = pd.DataFrame({c : np.random.random(4) for c in ['a', 'b']})

In [4]: date = pd.DataFrame({'day' : 15, 'month' : df1.index + 1, 'year' : 2009})

In [5]: df1['date'] = pd.to_datetime(date)

In [6]: df1['month'] = df1['date'].dt.month

In [7]: df2 = df1.groupby('month').max()

In [8]: df2.transpose().mean()
Out[8]: 
Series([], dtype: float64)

Problem description

df2 had strictly positive dimensions, its mean cannot be an empty Series.

Expected Output

The same as

In [9]: df2.mean(axis=1)
Out[9]: 
month
1     0.649630
2     0.392446
3     0.693638
12    0.381830
dtype: float64

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.7.0-1-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: it_IT.utf8 LOCALE: it_IT.UTF-8 pandas: 0.19.0+603.g2cad4dd0b pytest: 3.0.6 pip: 9.0.1 setuptools: 33.1.1 Cython: 0.25.2 numpy: 1.12.0 scipy: 0.18.1 xarray: 0.9.1 IPython: 5.1.0.dev sphinx: 1.4.9 patsy: 0.3.0-dev dateutil: 2.5.3 pytz: 2016.7 blosc: None bottleneck: 1.2.0 tables: 3.3.0 numexpr: 2.6.1 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: 2.3.0 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.6 lxml: 3.7.1 bs4: 4.5.3 html5lib: 0.999999999 sqlalchemy: 1.0.15 pymysql: None psycopg2: None jinja2: 2.8 s3fs: None pandas_gbq: None pandas_datareader: 0.2.1

Comment From: jreback

this is as expected, you have a an all object array

In [5]: df2.transpose()
Out[5]: 
month                    1                    2                    3                    4
a                 0.392964            0.0827719             0.129667             0.333425
b                 0.661218             0.342849             0.503337             0.409731
date   2009-01-15 00:00:00  2009-02-15 00:00:00  2009-03-15 00:00:00  2009-04-15 00:00:00

In [7]: df2.transpose().mean(numeric_only=False)
TypeError: unsupported operand type(s) for +: 'Timestamp' and 'float'

Comment From: jreback

this by default skips the non-numeric values

In [9]: df2.mean(axis=1)
Out[9]: 
month
1    0.527091
2    0.212811
3    0.316502
4    0.371578
dtype: float64

In [12]: df2.select_dtypes(include=['float']).T.mean()
Out[12]: 
month
1    0.527091
2    0.212811
3    0.316502
4    0.371578
dtype: float64

Comment From: toobaz

Aha! Right, sorry

Comment From: toobaz

While I'm at it... is the following also desired?

In [3]: df = pd.DataFrame([[1,2,3],[4,5,6]])

In [4]: list(df.groupby(1).mean())
Out[4]: [0, 2]

In [5]: list(df.groupby(1).apply(np.mean))
Out[5]: [0, 1, 2]

that is, native groupby operations differ from user-provided methods in the fact that they miss the groupby column

Comment From: jorisvandenbossche

@toobaz I think that is how apply works (also applying on the key column). You can use .agg if you want the "aggregation behaviour" of not applying on the key column:

In [18]: df.groupby(1).agg(lambda x: np.mean(x))
Out[18]: 
   0  2
1      
2  1  3
5  4  6

(only using the lambda to show that this works on a user defined function, agg(np.mean) would take a fast path the same as groupby().mean())

But, we should have a better specification of those aspects of groupby.

Comment From: toobaz

@jorisvandenbossche : right, makes a lot of sense! And I now see the following note I had missed: "apply can act as a reducer, transformer, or filter function, depending on exactly what is passed to it. So depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may be included in the output as well as set the indices."

Pandas Groupby on date component results in corrupt DataFrame

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`