In a nutshell:
groupby
on a single categorical column with prescribed categories incorrectly returns results for all categories, even those that are not actually present in the DataFrame.- In addition, when aggregating after grouping on categoricals (
groupby.sum
and the likes), with both prescribed and non-prescribed categories, we get values for all possible combinations of categories, including those not present in the DataFrame.
Problem description and code samples
Case 1: group by a single column
Consider the following code where we define a DataFrame with a categorical column with prescribed categories.
import pandas as pd
labels = pd.Series(list('abcbabcab'))
df = pd.DataFrame({'label1': labels,
'x': [0, 1, 1, 1, 1, 0, 1, 1, 1]})
df.label1 = labels.astype('category', categories=list('abcdef'))
When we group by the categorical label1
column and aggregate, we incorrectly get results for all prescribed categories, including those that are not present in the DataFrame (see rows below where the value of x is NaN
):
In[6]: df.groupby('label1').sum()
Out[6]:
x
label1
a 2.0
b 3.0
c 2.0
d NaN
e NaN
f NaN
The elements in excess are already present in the groupby
object:
In[7]: df.groupby('label1').groups
Out[7]:
{'a': Int64Index([0, 4, 7], dtype='int64'),
'b': Int64Index([1, 3, 5, 8], dtype='int64'),
'c': Int64Index([2, 6], dtype='int64'),
'd': Int64Index([], dtype='int64'),
'e': Int64Index([], dtype='int64'),
'f': Int64Index([], dtype='int64')}
The above doesn't happen if the categories for the label1
column are not prescribed or if the column is not converted to a categorical at all.
Case 2: group by multiple columns
Consider now the case where we have two categorical columns:
df = pd.DataFrame({'label1': labels,
'label2': labels,
'x': [0, 1, 1, 1, 1, 0, 1, 1, 1]})
df.label1 = labels.astype('category', categories=list('abcdef'))
df.label2 = labels.astype('category', categories=list('abcdef'))
Contrary to the single column case above, we do get the correct group labels when we group by both categorical columns label1
and label2
:
In[11]: df.groupby(['label1', 'label2']).groups
Out[11]:
{('a', 'a'): Int64Index([0, 4, 7], dtype='int64'),
('b', 'b'): Int64Index([1, 3, 5, 8], dtype='int64'),
('c', 'c'): Int64Index([2, 6], dtype='int64')}
But we still get incorrect results if we aggregate:
In[12]: df.groupby(['label1', 'label2']).sum()
Out[12]:
x
label1 label2
a a 2.0
b NaN
c NaN
d NaN
e NaN
f NaN
b a NaN
b 3.0
c NaN
d NaN
e NaN
f NaN
c a NaN
b NaN
c 2.0
d NaN
e NaN
f NaN
d a NaN
b NaN
c NaN
d NaN
e NaN
f NaN
e a NaN
b NaN
c NaN
d NaN
e NaN
f NaN
f a NaN
b NaN
c NaN
d NaN
e NaN
f NaN
Note: The aggregation shows the same inccorect behaviour also when we don't prescribe the categories:
In[13]: df.label1 = labels.astype('category')
...: df.label2 = labels.astype('category')
In[14]: df.groupby(['label1', 'label2']).sum()
Out[14]:
x
label1 label2
a a 2.0
b NaN
c NaN
b a NaN
b 3.0
c NaN
c a NaN
b NaN
c 2.0
Expected Output
For Case 1:
In[6]: df.groupby('label1').sum()
Out[6]:
x
label1
a 2
b 3
c 2
In[7]: df.groupby('label1').groups
Out[7]:
{'a': Int64Index([0, 4, 7], dtype='int64'),
'b': Int64Index([1, 3, 5, 8], dtype='int64'),
'c': Int64Index([2, 6], dtype='int64')}
For Case 2:
In[19]: df.groupby(['label1', 'label2']).sum()
Out[19]:
x
label1 label2
a a 2
b b 3
c c 2
Output of pd.show_versions()
Comment From: jreback
duplicate of https://github.com/pandas-dev/pandas/issues/17594, see expl there.