Pandas incorrect len(groupby) returned from pandas versions 0.17.X to 0.23

Code Sample

import pandas as pd
import numpy as np
import time as time


def test_len_of_groupby(group_labels, size=30000):

    random_cat1 = np.repeat(group_labels[0], size)
    random_cat2 = np.repeat(group_labels[1], size)

    df = pd.DataFrame({'A': random_cat1, 'B': random_cat2})

    col_A = df['A']
    col_B = df['B']

    print "column A dtype: ", col_A.dtype
    print "column B dtype: ", col_B.dtype

    grouped = df.groupby([col_A, col_B])

    print "getting length of groupby obj ..."
    t0 = time.time()
    print len(grouped)
    t1 = time.time()
    print "time to get length of groupby obj :", t1-t0, "\n"

print "pandas: ",pd.__version__

print "testing length of groupby object with column B floats and np.nan ..."
group_labels_with_nan = [['A12', 'B1', 'B23', 'C1D0', 'D1', 'D2', 'DD2', 'R2D2'],
                         [1.4, 1.45, 1.5, 1.55, 1.6, 1.65, 1.7, np.nan]]

test_len_of_groupby(group_labels_with_nan)

print "testing length of groupby object with column B floats and no np.nan ..."
group_labels_without_nan = [['A12', 'B1', 'B23', 'C1D0', 'D1', 'D2', 'DD2', 'R2D2'],
                            [1.4, 1.45, 1.5, 1.55, 1.6, 1.65, 1.7, 1.75]]

test_len_of_groupby(group_labels_without_nan)

group_labels_with_nan_and_string_type = [['A12', 'B1', 'B23', 'C1D0', 'D1', 'D2', 'DD2', 'R2D2'],
                                         ['1.4', '1.45', '1.5', '1.55', '1.6', '1.65', '1.7', np.nan]]


print "testing length of groupby object with column B strings and np.nan ..."
test_len_of_groupby(group_labels_with_nan_and_string_type)

Problem description

len(groupby) is extremely slow from pandas 0.17.0 ( onwards (including latest 0.23.0 release), and appears to count all groupings in the dataframe, not just the unique groups in the dataframe, when column A is dtype object, and column B is dtype float64 with numpy NaNs present.

The following timings output was observed running the above test code with pandas 0.17.0 (the version of pandas that introduced the observed issue):

*pandas:  0.17.0*
testing length of groupby object with column B floats and np.nan ...
column A dtype:  object
column B dtype:  float64
getting length of groupby obj ...
*30007* <-- incorrect length! should be length 8.
time to get length of groupby obj : *48.4648609161* 

testing length of groupby object with column B floats and no np.nan ...
column A dtype:  object
column B dtype:  float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.261762857437 

testing length of groupby object with column B strings and np.nan ...
column A dtype:  object
column B dtype:  object
getting length of groupby obj ...
8
time to get length of groupby obj : 0.21554684639

with pandas 0.23.0 :

*pandas:  0.23.0*
testing length of groupby object with column B floats and np.nan ...
column A dtype:  object
column B dtype:  float64
getting length of groupby obj ...
*30007* <-- incorrect length! should be length 8.
time to get length of groupby obj : *119.291817188*

testing length of groupby object with column B floats and no np.nan ...
column A dtype:  object
column B dtype:  float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.248715162277 

testing length of groupby object with column B strings and np.nan ...
column A dtype:  object
column B dtype:  object
getting length of groupby obj ...
8
time to get length of groupby obj : 0.202826976776

with pandas 0.23.0 and grouped = df.groupby([col_A, col_B], observed=True):

pandas:  0.23.0
testing length of groupby object with column B floats and np.nan ...
column A dtype:  object
column B dtype:  float64
getting length of groupby obj ...
30007 <-- incorrect length! should be length 8.
time to get length of groupby obj : 109.039965868 

testing length of groupby object with column B floats and no np.nan ...
column A dtype:  object
column B dtype:  float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.246381998062 

testing length of groupby object with column B strings and np.nan ...
column A dtype:  object
column B dtype:  object
getting length of groupby obj ...
8
time to get length of groupby obj : 0.205888032913

Expected Output

with pandas 0.16.0, the correct output is demonstrated for the length of groupby. Note that there are only ever actually 8 groupings, as the pandas columns 'A' and 'B' are repeating sequences of 8 unique groupings within the dataframe.

pandas:  0.16.0
testing length of groupby object with column B floats and np.nan ...
column A dtype:  object
column B dtype:  float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.0208721160889 

testing length of groupby object with column B floats and no np.nan ...
column A dtype:  object
column B dtype:  float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.015851020813 

testing length of groupby object with column B strings and np.nan ...
column A dtype:  object
column B dtype:  object
getting length of groupby obj ...
8
time to get length of groupby obj : 0.0218949317932

Output of `pd.show_versions()` For pandas 0.17.0 test:

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

commit: None python: 2.7.14.final.0 python-bits: 64 /Users/steve/projects/blendenv/lib/python2.7/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi. """) OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None

pandas: 0.17.0 nose: 1.3.7 pip: 10.0.1 setuptools: 37.0.0 Cython: 0.25.2 numpy: 1.14.0 scipy: 0.16.0 statsmodels: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2014.4 blosc: None bottleneck: 1.2.0 tables: 3.4.2 numexpr: 2.6.4 matplotlib: 1.4.3 openpyxl: 1.8.6 xlrd: 0.9.3 xlwt: None xlsxwriter: 0.9.6 lxml: 3.5.0 bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.9 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64)

Output of `pd.show_versions()` For pandas 0.16.0 test:

INSTALLED VERSIONS

commit: None python: 2.7.14.final.0 python-bits: 64 OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_AU.UTF-8

pandas: 0.16.0 nose: 1.3.7 Cython: 0.25.2 numpy: 1.14.0 scipy: 0.16.0 statsmodels: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2014.4 bottleneck: 1.2.0 tables: 3.4.2 numexpr: 2.6.4 matplotlib: 1.4.3 openpyxl: 1.8.6 xlrd: 0.9.3 xlwt: None xlsxwriter: 0.9.6 lxml: 3.5.0 bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.9 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64)

Output of `pd.show_versions()` For pandas 0.23.0 tests:

INSTALLED VERSIONS

commit: None python: 2.7.14.final.0 python-bits: 64 OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: None.None

pandas: 0.23.0 pytest: 3.3.1 pip: 10.0.1 setuptools: 37.0.0 Cython: 0.25.2 numpy: 1.14.0 scipy: 0.16.0 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2014.4 blosc: None bottleneck: 1.2.0 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 1.4.3 openpyxl: 1.8.6 xlrd: 0.9.3 xlwt: None xlsxwriter: 0.9.6 lxml: 3.5.0 bs4: None html5lib: None sqlalchemy: 1.0.9 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: gfyoung

Yikes! That's quite awhile ago when this performance regression was introduced.

Investigation and PR are welcome!

Comment From: orrery

@gfyoung thanks! I just want to mention as well the result returned was also incorrect (should always be length of 8 w.r.t. the test code).

Comment From: mroeschke

This is a fairly old benchmark for an unsupported version of python and pandas, so closing for now, but happy to reopen if the current benchmarks see a similar slowdown

Pandas incorrect len(groupby) returned from pandas versions 0.17.X to 0.23

Code Sample

Problem description

Expected Output

Output of pd.show_versions() For pandas 0.17.0 test:

Output of pd.show_versions() For pandas 0.16.0 test:

INSTALLED VERSIONS

Output of pd.show_versions() For pandas 0.23.0 tests:

INSTALLED VERSIONS

Output of `pd.show_versions()` For pandas 0.17.0 test:

Output of `pd.show_versions()` For pandas 0.16.0 test:

Output of `pd.show_versions()` For pandas 0.23.0 tests: