Sample data
>>> import pandas as pd
>>> pd.__version__ # on current master
'1.1.0.dev0+613.g97c0ce962'
>>> df = pd.DataFrame({'key': list('aaabbb'), 'value': [1, 2, 3, 3, 2, 1]})
>>> df
key value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 2
5 b 1
Issue
>>> df.groupby('key')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C0D32FC400>
# DataFrameGroupBy.corrwith has no issues
>>> df.groupby('key').corrwith(pd.Series([1,2,3,1,2,3]))
value
key
a 1.0
b -1.0
>>> df.groupby('key')['value']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001C0D279BF28>
# SeriesGroupBy.corrwith is not implemented
>>> df.groupby('key')['value'].corrwith(pd.Series([1, 2, 3, 1, 2, 3]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\users\xxx\xxx\pandas\pandas\core\groupby\groupby.py", line 580, in __getattr__
f"'{type(self).__name__}' object has no attribute '{attr}'"
AttributeError: 'SeriesGroupBy' object has no attribute 'corrwith'
Problem description
As shown above, DataFrameGroupBy.corrwith has no issues and works as expected. However, the corresponding SeriesGroupBy.corrwith is not implemented and reports error.
Expected Output
>>> df.groupby('key')['value'].corrwith(pd.Series([1, 2, 3, 1, 2, 3]))
value
key
a 1.0
b -1.0
Comment From: fujiaxiang
Turns out Series.corr
behaves similarly to DataFrame.corrwith
, and because SeriesGroupBy.corr
, and DataFrameGroupBy.corrwith
reuses base counterparts, they behave similarly too.
So this behavior can be achieved by:
>>> import pandas as pd
>>> pd.__version__
'1.1.0.dev0+1712.g1cad9e52e'
>>> df = pd.DataFrame({'key': list('aaabbb'), 'value': [1, 2, 3, 3, 2, 1]})
>>> df.groupby('key')['value'].corr(pd.Series([1, 2, 3, 1, 2, 3]))
key
a 1.0
b -1.0
Name: value, dtype: float64
A few thoughts:
1. Why does Series.corr
behaves like DataFrame.corrwith
, while DataFrame.corr
behaves differently?
The first two compute correlation with another object - parameter other
, whereas the latter compute correlation matrix by itself.
1. I tried implementing Series.corrwith
, but it will behave exactly the same as Series.corr
if other
is also a Series
. This causes confusion so I don't feel it's a good solution.
1. DataFrame.corrwith
can accept both DataFrame
and Series
as other
, but Series.corr
can only accept Series
and returns a single number.
1. I feel the "best" solution is to rename Series.corr
into Series.corrwith
, enhance it to accept DataFrame
, and deprecate Series.corr
or make it an alias for Series.corrwith
, which helps maintain backward compatibility. Finally we also want to add corrwith
in common_apply_whitelist
in pandas/core/groupby/base.py
so that SeriesGroupBy.corrwith
is automatically usable.
@simonjayhawkins what do you think?
Comment From: simonjayhawkins
4. I feel the "best" solution is to rename
Series.corr
intoSeries.corrwith
, enhance it to acceptDataFrame
, and deprecateSeries.corr
or make it an alias forSeries.corrwith
, which helps maintain backward compatibility. Finally we also want to addcorrwith
incommon_apply_whitelist
inpandas/core/groupby/base.py
so thatSeriesGroupBy.corrwith
is automatically usable.@simonjayhawkins what do you think?
see also #11260 cc @pandas-dev/pandas-core
Comment From: rhshadrach
DataFrameGroupBy.agg(["corrwith"])
also fails because it attempts to break up the operation into SeriesGroupBy and call corrwith there.