Code Sample, a copy-pastable example if possible
df = pd.DataFrame({'A': ['a','b',None,'a','b',None], 'B': range(6)})
df.groupby('A').sum() # output has two rows
df.groupby('A').first() # output has two rows
df.groupby('A').head(1) # output has three rows, one for the null group
df.groupby('A').nth(1) # output has three rows, but one of them has the wrong value in the 'A' column.
Problem description
My understanding is that when grouping on a column, pandas excludes null values (link). However, when calling the head
method on a GroupBy object, the null group is returned. While I generally would support keeping null values across the board, this inconsistency contradicts the documentation. Looking to see if similar methods had the same behavior, I discovered that nth
has an even worse bug.
Output of pd.show_versions()
Comment From: gfyoung
Yikes! That looks pretty buggy to me. Investigation and patch are more than welcome!
Comment From: paritoshmittal09
Hi,
I am just starting my opensource journey so apologies at the start if I do something wrong.
I wanted to get some clarification what should be the output if:
df = pd.DataFrame({'A': ['a','b',None,'a','b',None], 'B':[None,None,None,None,None,None] })
df.groupby('A').head(1)
I am assuming it should return Empty Data Frame. Please correct me If I am wrong.
I would like to take this bug up and update it.
Comment From: louispotok
@paritoshmittal09
In your example, df.groupby('A').head(1)
should return the same dataframe as pd.DataFrame({'A': ['a','b'], 'B': [None, None]})
. The issue being discussed here is that the column we're grouping by (A
) should not keep null values. Null values in other columns (like B
) should be kept.
Comment From: louispotok
While I was looking back at this, I noticed one nuance that whoever fixes this should be aware of, which is that head
keeps the grouping column (A
) as a column and preserves the original index - the other methods here move the grouping column (A
) to the index.
Comment From: paritoshmittal09
Hi, Thanks for the clarification for the head. I have worked on the solution as discussed here and I maintain the grouping column as a column.
import pandas as pd
df = pd.DataFrame({'A': ['a','b',None,'a','b',None], 'B':[1,None,3,None,None,None] })
print(df.groupby('A').head(1))
giving output as
pd.DataFrame({'A': ['a','b'], 'B': [1, None]})
I think this is desired behavior.
Comment From: paritoshmittal09
Also the same issue exists in tail method as well.
Comment From: adamhooper
There's yet another oddity -- must be related -- that may shed light on this one:
>>> df = pd.DataFrame({'A': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', np.nan, 'b'], 'B': ['a', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'b', 'b']})
>>> df # the full table
A B
0 a a
1 a NaN
2 a NaN
3 a NaN
4 a NaN
5 a NaN
6 a NaN
7 a NaN
8 NaN b
9 b b
>>> df.groupby(['A', 'B']).head(1) # picks a row from the (a, NaN) group
A B C
0 a a 1
1 a NaN 2
4 b b 5
>>> # ditto for head(2) .. head(7)
>>> df.groupby(['A', 'B']).head(8)
A B C
A B
0 a a
1 a NaN
2 a NaN
3 a NaN
4 a NaN
5 a NaN
6 a NaN
7 a NaN
8 NaN b
9 b b
Strange tidbits:
head(1)
picked the first element,(a, NaN, 2)
head(8)
picked from the next group.
Comment From: rhshadrach
All issues reported here appear to be fixed on main. Marking this as needs tests, but I haven't checked if similar tests exist.
Comment From: pbhoopala
take