Code Sample
import pandas as pd
import numpy as np
import time as time
def test_len_of_groupby(group_labels, size=30000):
random_cat1 = np.repeat(group_labels[0], size)
random_cat2 = np.repeat(group_labels[1], size)
df = pd.DataFrame({'A': random_cat1, 'B': random_cat2})
col_A = df['A']
col_B = df['B']
print "column A dtype: ", col_A.dtype
print "column B dtype: ", col_B.dtype
grouped = df.groupby([col_A, col_B])
print "getting length of groupby obj ..."
t0 = time.time()
print len(grouped)
t1 = time.time()
print "time to get length of groupby obj :", t1-t0, "\n"
print "pandas: ",pd.__version__
print "testing length of groupby object with column B floats and np.nan ..."
group_labels_with_nan = [['A12', 'B1', 'B23', 'C1D0', 'D1', 'D2', 'DD2', 'R2D2'],
[1.4, 1.45, 1.5, 1.55, 1.6, 1.65, 1.7, np.nan]]
test_len_of_groupby(group_labels_with_nan)
print "testing length of groupby object with column B floats and no np.nan ..."
group_labels_without_nan = [['A12', 'B1', 'B23', 'C1D0', 'D1', 'D2', 'DD2', 'R2D2'],
[1.4, 1.45, 1.5, 1.55, 1.6, 1.65, 1.7, 1.75]]
test_len_of_groupby(group_labels_without_nan)
group_labels_with_nan_and_string_type = [['A12', 'B1', 'B23', 'C1D0', 'D1', 'D2', 'DD2', 'R2D2'],
['1.4', '1.45', '1.5', '1.55', '1.6', '1.65', '1.7', np.nan]]
print "testing length of groupby object with column B strings and np.nan ..."
test_len_of_groupby(group_labels_with_nan_and_string_type)
Problem description
len(groupby)
is extremely slow from pandas 0.17.0 ( onwards (including latest 0.23.0 release), and appears to count all groupings in the dataframe, not just the unique groups in the dataframe, when column A is dtype object, and column B is dtype float64 with numpy NaNs present.
The following timings output was observed running the above test code with pandas 0.17.0 (the version of pandas that introduced the observed issue):
*pandas: 0.17.0*
testing length of groupby object with column B floats and np.nan ...
column A dtype: object
column B dtype: float64
getting length of groupby obj ...
*30007* <-- incorrect length! should be length 8.
time to get length of groupby obj : *48.4648609161*
testing length of groupby object with column B floats and no np.nan ...
column A dtype: object
column B dtype: float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.261762857437
testing length of groupby object with column B strings and np.nan ...
column A dtype: object
column B dtype: object
getting length of groupby obj ...
8
time to get length of groupby obj : 0.21554684639
with pandas 0.23.0 :
*pandas: 0.23.0*
testing length of groupby object with column B floats and np.nan ...
column A dtype: object
column B dtype: float64
getting length of groupby obj ...
*30007* <-- incorrect length! should be length 8.
time to get length of groupby obj : *119.291817188*
testing length of groupby object with column B floats and no np.nan ...
column A dtype: object
column B dtype: float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.248715162277
testing length of groupby object with column B strings and np.nan ...
column A dtype: object
column B dtype: object
getting length of groupby obj ...
8
time to get length of groupby obj : 0.202826976776
with pandas 0.23.0 and grouped = df.groupby([col_A, col_B], observed=True)
:
pandas: 0.23.0
testing length of groupby object with column B floats and np.nan ...
column A dtype: object
column B dtype: float64
getting length of groupby obj ...
30007 <-- incorrect length! should be length 8.
time to get length of groupby obj : 109.039965868
testing length of groupby object with column B floats and no np.nan ...
column A dtype: object
column B dtype: float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.246381998062
testing length of groupby object with column B strings and np.nan ...
column A dtype: object
column B dtype: object
getting length of groupby obj ...
8
time to get length of groupby obj : 0.205888032913
Expected Output
with pandas 0.16.0, the correct output is demonstrated for the length of groupby. Note that there are only ever actually 8 groupings, as the pandas columns 'A' and 'B' are repeating sequences of 8 unique groupings within the dataframe.
pandas: 0.16.0
testing length of groupby object with column B floats and np.nan ...
column A dtype: object
column B dtype: float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.0208721160889
testing length of groupby object with column B floats and no np.nan ...
column A dtype: object
column B dtype: float64
getting length of groupby obj ...
8
time to get length of groupby obj : 0.015851020813
testing length of groupby object with column B strings and np.nan ...
column A dtype: object
column B dtype: object
getting length of groupby obj ...
8
time to get length of groupby obj : 0.0218949317932
Output of pd.show_versions()
For pandas 0.17.0 test:
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None python: 2.7.14.final.0 python-bits: 64 /Users/steve/projects/blendenv/lib/python2.7/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi. """) OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None
pandas: 0.17.0 nose: 1.3.7 pip: 10.0.1 setuptools: 37.0.0 Cython: 0.25.2 numpy: 1.14.0 scipy: 0.16.0 statsmodels: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2014.4 blosc: None bottleneck: 1.2.0 tables: 3.4.2 numexpr: 2.6.4 matplotlib: 1.4.3 openpyxl: 1.8.6 xlrd: 0.9.3 xlwt: None xlsxwriter: 0.9.6 lxml: 3.5.0 bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.9 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64)
Output of pd.show_versions()
For pandas 0.16.0 test:
INSTALLED VERSIONS
commit: None python: 2.7.14.final.0 python-bits: 64 OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_AU.UTF-8
pandas: 0.16.0 nose: 1.3.7 Cython: 0.25.2 numpy: 1.14.0 scipy: 0.16.0 statsmodels: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2014.4 bottleneck: 1.2.0 tables: 3.4.2 numexpr: 2.6.4 matplotlib: 1.4.3 openpyxl: 1.8.6 xlrd: 0.9.3 xlwt: None xlsxwriter: 0.9.6 lxml: 3.5.0 bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.9 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64)
Output of pd.show_versions()
For pandas 0.23.0 tests:
INSTALLED VERSIONS
commit: None python: 2.7.14.final.0 python-bits: 64 OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: None.None
pandas: 0.23.0 pytest: 3.3.1 pip: 10.0.1 setuptools: 37.0.0 Cython: 0.25.2 numpy: 1.14.0 scipy: 0.16.0 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2014.4 blosc: None bottleneck: 1.2.0 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 1.4.3 openpyxl: 1.8.6 xlrd: 0.9.3 xlwt: None xlsxwriter: 0.9.6 lxml: 3.5.0 bs4: None html5lib: None sqlalchemy: 1.0.9 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Comment From: gfyoung
Yikes! That's quite awhile ago when this performance regression was introduced.
Investigation and PR are welcome!
Comment From: orrery
@gfyoung thanks! I just want to mention as well the result returned was also incorrect (should always be length of 8 w.r.t. the test code).
Comment From: mroeschke
This is a fairly old benchmark for an unsupported version of python and pandas, so closing for now, but happy to reopen if the current benchmarks see a similar slowdown