Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'c'])
print(df.groupby('a', as_index=False).sum())

Issue Description

Only first column (groupby key) is preserved:

Empty DataFrame
Columns: [a]
Index: []

Expected Behavior

All columns of original dataframe should be preserved:

Empty DataFrame
Columns: [a, b, c]
Index: []

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.7.9.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-693.5.2.el7.x86_64
Version          : #1 SMP Fri Oct 20 20:32:50 UTC 2017
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.4
numpy            : 1.19.4
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 47.3.1.post20210215
Cython           : 0.29.21
pytest           : 5.4.3
hypothesis       : 5.30.0
sphinx           : 3.0.3
blosc            : None
feather          : None
xlsxwriter       : 1.2.9
lxml.etree       : 4.5.2
html5lib         : 1.1
pymysql          : 0.10.1
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 2.11.2
IPython          : 7.17.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 0.8.3
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : 2.0.0
pytables         : None
pyxlsb           : 1.0.9
s3fs             : None
scipy            : 1.5.4
sqlalchemy       : 1.3.20
tables           : 3.6.1
tabulate         : 0.8.7
xarray           : 0.16.1
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.51.2

I also checked that the issue is still present in pandas 1.4.1.

Comment From: ryansdowning

Can confirm this issue exists on main. The issue also arises with DataFrameGroupBy.mean(), .sum(), .median(), but interestingly does not occur with .apply(), .min(), .max(), .all(), or .any(). I am looking further into this and will report back if I find anything.

Comment From: ryansdowning

It looks like this has something to do with the numeric_only flag in pandas.core.groupby.generic.DataFrameGroupBy._agg_general. By default the aforementioned methods which do not have this issue (min, max) use numeric_only=False. The problematic methods (sum, mean, median) use lib.no_default as the default numeric_only argument. Here are some interesting things you can test to see whats going on:

note: I left off as_index=False from the groupby because it does not make a difference here

In [1]: import pandas as pd, numpy as np

In [2]: df = pd.DataFrame(columns=['a', 'b', 'c'])

In [3]: df.groupby('a').sum()
Out[3]: 
Empty DataFrame
Columns: []
Index: []

In [4]: df.groupby('a').min()
Out[4]: 
Empty DataFrame
Columns: [b, c]
Index: []

In [5]: df.groupby('a').min(numeric_only=True)
Out[5]: 
Empty DataFrame
Columns: []
Index: []

In [6]: df.groupby('a').sum(numeric_only=False)
Out[6]: 
Empty DataFrame
Columns: [b, c]
Index: []

I am not sure if the default behavior of these methods is intended to be different, so I'll leave it to the maintainers to direct closing this issue.

Comment From: dospix

take

Comment From: hhheidi

take

Comment From: hhheidi

@ryansdowning's examples are very helpful! I definitely agree that this is caused by numeric_only=True. I also found that this issue isn't unique to empty dataframes. Here:

In [0]:  df = pd.DataFrame(
        {
            "a": [0, 0, 1, 1],
            "b": [1, "x", 2, "y"],
            "c": [1, 1, 2, 2],
        } 
     )
Out [0]:
| a | b | c
  0 | 1 | 1
  0 | x | 1
  1 | 2 | 2
  1 | y | 2

numeric_only=False performs as expected, as does numeric_only=None:

In [1]: df.groupby('a').first(numeric_only=False)
Out [1]:
  | b | c
a _______
0   1 | 1
1   2 | 2

In [2]: df.groupby('a').first(numeric_only=None)
Out [2]:
  | b | c
a _______
0   1 | 1
1   2 | 2

while numeric_only=True drops a column:

In [3]: df.groupby('a').first(numeric_only=True)
Out [3]:
  | c
a ____
0   1
1   2

It looks like any columns whose elements aren't strictly numeric are getting pruned.

Edit: actually, I'm not sure if that's the correct behavior for numeric_only=False, or if it's supposed to raise an exception.

Comment From: simonjayhawkins

Expected Behavior

All columns of original dataframe should be preserved:

Empty DataFrame Columns: [a, b, c] Index: []

the was the result in pandas-1.2.5

first bad commit: [6b94e2431536eac4d943e55064d26deaac29c92b] BUG: DataFrameGroupBy with numeric_only and empty non-numeric data (#41706)

@jbrockmendel

Comment From: jbrockmendel

Will this be fixed automatically in 2.0 when the numeric_only default/behavior changes?

Comment From: jbrockmendel

@rhshadrach is this closed by the numeric_only deprecation?

Comment From: rhshadrach

Yes - closing.