Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'c'])
print(df.groupby('a', as_index=False).sum())
Issue Description
Only first column (groupby key) is preserved:
Empty DataFrame
Columns: [a]
Index: []
Expected Behavior
All columns of original dataframe should be preserved:
Empty DataFrame
Columns: [a, b, c]
Index: []
Installed Versions
INSTALLED VERSIONS
------------------
commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python : 3.7.9.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-693.5.2.el7.x86_64
Version : #1 SMP Fri Oct 20 20:32:50 UTC 2017
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 47.3.1.post20210215
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.30.0
sphinx : 3.0.3
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : 0.10.1
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pytables : None
pyxlsb : 1.0.9
s3fs : None
scipy : 1.5.4
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2
I also checked that the issue is still present in pandas 1.4.1.
Comment From: ryansdowning
Can confirm this issue exists on main. The issue also arises with DataFrameGroupBy.mean()
, .sum()
, .median()
, but interestingly does not occur with .apply()
, .min()
, .max()
, .all()
, or .any()
. I am looking further into this and will report back if I find anything.
Comment From: ryansdowning
It looks like this has something to do with the numeric_only
flag in pandas.core.groupby.generic.DataFrameGroupBy._agg_general
. By default the aforementioned methods which do not have this issue (min, max) use numeric_only=False
. The problematic methods (sum, mean, median) use lib.no_default
as the default numeric_only
argument. Here are some interesting things you can test to see whats going on:
note: I left off as_index=False
from the groupby because it does not make a difference here
In [1]: import pandas as pd, numpy as np
In [2]: df = pd.DataFrame(columns=['a', 'b', 'c'])
In [3]: df.groupby('a').sum()
Out[3]:
Empty DataFrame
Columns: []
Index: []
In [4]: df.groupby('a').min()
Out[4]:
Empty DataFrame
Columns: [b, c]
Index: []
In [5]: df.groupby('a').min(numeric_only=True)
Out[5]:
Empty DataFrame
Columns: []
Index: []
In [6]: df.groupby('a').sum(numeric_only=False)
Out[6]:
Empty DataFrame
Columns: [b, c]
Index: []
I am not sure if the default behavior of these methods is intended to be different, so I'll leave it to the maintainers to direct closing this issue.
Comment From: dospix
take
Comment From: hhheidi
take
Comment From: hhheidi
@ryansdowning's examples are very helpful! I definitely agree that this is caused by numeric_only=True
. I also found that this issue isn't unique to empty dataframes. Here:
In [0]: df = pd.DataFrame(
{
"a": [0, 0, 1, 1],
"b": [1, "x", 2, "y"],
"c": [1, 1, 2, 2],
}
)
Out [0]:
| a | b | c
0 | 1 | 1
0 | x | 1
1 | 2 | 2
1 | y | 2
numeric_only=False
performs as expected, as does numeric_only=None
:
In [1]: df.groupby('a').first(numeric_only=False)
Out [1]:
| b | c
a _______
0 1 | 1
1 2 | 2
In [2]: df.groupby('a').first(numeric_only=None)
Out [2]:
| b | c
a _______
0 1 | 1
1 2 | 2
while numeric_only=True
drops a column:
In [3]: df.groupby('a').first(numeric_only=True)
Out [3]:
| c
a ____
0 1
1 2
It looks like any columns whose elements aren't strictly numeric are getting pruned.
Edit: actually, I'm not sure if that's the correct behavior for numeric_only=False
, or if it's supposed to raise an exception.
Comment From: simonjayhawkins
Expected Behavior
All columns of original dataframe should be preserved:
Empty DataFrame Columns: [a, b, c] Index: []
the was the result in pandas-1.2.5
first bad commit: [6b94e2431536eac4d943e55064d26deaac29c92b] BUG: DataFrameGroupBy with numeric_only and empty non-numeric data (#41706)
@jbrockmendel
Comment From: jbrockmendel
Will this be fixed automatically in 2.0 when the numeric_only default/behavior changes?
Comment From: jbrockmendel
@rhshadrach is this closed by the numeric_only deprecation?
Comment From: rhshadrach
Yes - closing.