Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'A': ['a','b',None,'a','b',None], 'B': range(6)})
df.groupby('A').sum()  # output has two rows
df.groupby('A').first() # output has two rows
df.groupby('A').head(1) # output has three rows, one for the null group
df.groupby('A').nth(1) # output has three rows, but one of them has the wrong value in the 'A' column.

Problem description

My understanding is that when grouping on a column, pandas excludes null values (link). However, when calling the head method on a GroupBy object, the null group is returned. While I generally would support keeping null values across the board, this inconsistency contradicts the documentation. Looking to see if similar methods had the same behavior, I discovered that nth has an even worse bug.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-45-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.23.0 pytest: 3.4.2 pip: 10.0.1 setuptools: 38.4.0 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: 0.9.0 xarray: None IPython: 6.2.1 sphinx: 1.7.3 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: 0.4.0 matplotlib: 2.1.2 openpyxl: None xlrd: 1.1.0 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: 1.2.7 pymysql: None psycopg2: 2.7.5 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: gfyoung

Yikes! That looks pretty buggy to me. Investigation and patch are more than welcome!

Comment From: paritoshmittal09

Hi, I am just starting my opensource journey so apologies at the start if I do something wrong. I wanted to get some clarification what should be the output if: df = pd.DataFrame({'A': ['a','b',None,'a','b',None], 'B':[None,None,None,None,None,None] }) df.groupby('A').head(1)

I am assuming it should return Empty Data Frame. Please correct me If I am wrong.

I would like to take this bug up and update it.

Comment From: louispotok

@paritoshmittal09

In your example, df.groupby('A').head(1) should return the same dataframe as pd.DataFrame({'A': ['a','b'], 'B': [None, None]}). The issue being discussed here is that the column we're grouping by (A) should not keep null values. Null values in other columns (like B) should be kept.

Comment From: louispotok

While I was looking back at this, I noticed one nuance that whoever fixes this should be aware of, which is that head keeps the grouping column (A) as a column and preserves the original index - the other methods here move the grouping column (A) to the index.

Comment From: paritoshmittal09

Hi, Thanks for the clarification for the head. I have worked on the solution as discussed here and I maintain the grouping column as a column.

import pandas as pd df = pd.DataFrame({'A': ['a','b',None,'a','b',None], 'B':[1,None,3,None,None,None] }) print(df.groupby('A').head(1))

giving output as pd.DataFrame({'A': ['a','b'], 'B': [1, None]})

I think this is desired behavior.

Comment From: paritoshmittal09

Also the same issue exists in tail method as well.

Comment From: adamhooper

There's yet another oddity -- must be related -- that may shed light on this one:

>>> df = pd.DataFrame({'A': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', np.nan, 'b'], 'B': ['a', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'b', 'b']})
>>> df  # the full table
     A    B
0    a    a
1    a  NaN
2    a  NaN
3    a  NaN
4    a  NaN
5    a  NaN
6    a  NaN
7    a  NaN
8  NaN    b
9    b    b
>>> df.groupby(['A', 'B']).head(1)  # picks a row from the (a, NaN) group
   A    B  C
0  a    a  1
1  a  NaN  2
4  b    b  5
>>> # ditto for head(2) .. head(7)
>>> df.groupby(['A', 'B']).head(8)
     A    B  C
     A    B
0    a    a
1    a  NaN
2    a  NaN
3    a  NaN
4    a  NaN
5    a  NaN
6    a  NaN
7    a  NaN
8  NaN    b
9    b    b

Strange tidbits:

  • head(1) picked the first element, (a, NaN, 2)
  • head(8) picked from the next group.

Comment From: rhshadrach

All issues reported here appear to be fixed on main. Marking this as needs tests, but I haven't checked if similar tests exist.

Comment From: pbhoopala

take