Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.DataFrame(columns=['time','value']).astype({'time':'datetime64[ns]','value':'int64'}).set_index('time').groupby([]).idxmin().dtypes
# output:
# value int64
# dtype: object
# compare to the result on a non-empty DataFrame
# pd.DataFrame(data=[['2023-01-26',1]],columns=['time','value']).astype({'time':'datetime64[ns]','value':'int64'}).set_index('time').groupby([0]).idxmin().dtypes
# output:
# value datetime64[ns]
# dtype: object
# the following lines can also reproduce the problem:
pd.DataFrame(columns=['time','value']).astype({'time':'datetime64[ns]','value':'int64'}).set_index('time').groupby([]).idxmax().dtypes
# output:
# value int64
# dtype: object
pd.DataFrame(columns=['time','value']).astype({'time':'datetime64[ns]','value':'int64'}).set_index('time')['value'].groupby([]).idxmin().dtypes
# output:
# dtype('int64')
pd.DataFrame(columns=['time','value']).astype({'time':'datetime64[ns]','value':'int64'}).set_index('time')['value'].groupby([]).idxmax().dtypes
# output:
# dtype('int64')
Issue Description
On a non-empty Series/DataFrame, calling .groupby.idxmin/idxmax
returns Series/DataFrame with the index dtype (datetime64[ns]
). But the same operation on an empty Series/DataFrame returns Series/DataFrame with the dtype of original column (int64
), which is inconsistent.
Expected Behavior
The GroupBy.idxmin/idxmax should produce Series/DataFrame with the index dtype, like the result shown below:
pd.DataFrame(columns=['time','value']).astype({'time':'datetime64[ns]','value':'int64'}).set_index('time').groupby([]).idxmin().dtypes
# output:
# value datetime64[ns]
# dtype: object
Installed Versions
Comment From: rhshadrach
Thanks for the report! Agreed on the expectation of retaining the index type, and confirmed on main. Further investigations and pull requests to fix are welcome!
Comment From: luke396
take
Comment From: luke396
I think it's mainly because here that keeps the dtype
of the returned value consistent with the user input and does not return the dtype
of the desired Index.
When using idxmin()
on an empty DataFrame
, it will report an error directly. I wonder if it is possible to report an error directly for DataFrameGroupy
instead of returning a null value. Or is there any better way to handle this and any suggestions?
df3 = pd.DataFrame(columns=['time','value']).astype({'time':'datetime64[ns]','value':'int64'}).set_index('time')
df3.idxmin( )
# ValueError: attempt to get argmin of an empty sequence
Comment From: rhshadrach
I wonder if it is possible to report an error directly for DataFrameGroupy instead of returning a null value.
I think the mental model for groupby should be:
for group in groups:
results.append(f(group))
combine(results)
With this, the expectation is that idxmin
is never actually computed, and so this shouldn't raise.
Or is there any better way to handle this and any suggestions?
The two easiest solutions I see are to (a) not call _wrap_applied_output or (b) add how
as an argument to this method and add in idxmin/max specific logic, but both of these seem to be not good. A much more intensive, but better solution, would be to develop a specific, performant, algorithm for groupby idxmin/max and sidestep calling _wrap_applied_output altogether.
Comment From: luke396
A much more intensive, but better solution, would be to develop a specific, performant, algorithm for groupby idxmin/max and sidestep calling _wrap_applied_output altogether.
Totally agree, I want to add
if f.__name__ in ['idxmin', 'idxmax']:
return _func_instead_warp_appied_ouput(......)
here.
Does this change break any design rules or have any potential problems?
Comment From: rhshadrach
Instead, I'd look into adding it straight to the methods idxmin
and idxmax
.