Pandas Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category'

Code Sample

import pandas as pd
import numpy as np
import time

n_rows = 1000
n_categories = 250

df = pd.DataFrame({'cat1': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat2': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'cat3': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
                   'real_numbers': np.random.normal(size=n_rows)})

# Case 1: Pivot table without multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2'])
print("Elapsed time:", time.time() - now)

# Case 2: Pivot table with multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2', 'cat3'])
print("Elapsed time:", time.time() - now)

# Typecast a feature to categorical.
df_cast = df.astype({'cat2': 'category'})

# Case 3: Pivot table without multi-index columns - still OK
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2'])
print("Elapsed time:", time.time() - now)

# Case 4: Pivot table with multi-index columns - strong increase in elapsed time/ memory footprint.
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2', 'cat3'])
print("Elapsed time:", time.time() - now)

Problem description

Creating a pivot table (with a multi-index) of a relatively small data frame with integer and float columns (case 2) goes much faster and uses much less resources compared to when the pivot table is created of the same data frame, but with one of the columns converted to a category (case 4).

On my (fairly old) system I find roughly a 200x increase in elapsed time between cases 2 and 4 from the code sample. Furthermore, memory consumption in case 2 is negligible, while in case 4 it runs well over 2GB. Increasing the number of categories n_categories in the original data frame the increases the resource usage/ execution time further.

On the other hand, when no multi-index is used, there only seems to be a very small drop in performance (e.g. from case 1 to case 3).

There are already various issues submitted that involve memory usage of the pivot_table function, however as far as I see for large data frames. This issue seems different as it happens with relatively small data frames, and only in the specific case of a categorical variable in a multi-index.

Expected Output

I am unaware of the implementation details, but I would assume that a categorical variable has an integer representation under the hood and I would therefore expect no performance penalty at all, regardless of the number of columns chosen.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None pandas: 0.23.0.dev0+259.g7dcc86443 pytest: 3.3.2 pip: 9.0.1 setuptools: 38.4.0 Cython: 0.27.3 numpy: 1.14.0 scipy: 1.0.0 pyarrow: 0.8.0 xarray: None IPython: 6.2.1 sphinx: 1.6.6 patsy: 0.5.0 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.1.2 openpyxl: 2.4.10 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.0.2 lxml: 4.1.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.1 pymysql: 0.7.11.None psycopg2: None jinja2: 2.10 s3fs: 0.1.2 fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: TomAugspurger

Do you mind doing some profiling of the two to find the difference? If you run a line profiler https://github.com/rkern/line_profiler on pd.pivot_table for the two cases, the difference should stick out.

Comment From: MishaVeldhoen

Yes, I’ll look into it!

Comment From: jreback

duplicate of this issue: https://github.com/pandas-dev/pandas/issues/15217

pivot is just calling groupby.

Comment From: jreback

welcome to have you look at that one (and I suggest the soln there) :>

Comment From: suokunlong

I still reproduce this issue with pandas version 0.25.3. It seems never fix at all, may not be a duplicate of #15217.

Pandas Performance issue when using pivot_table with multiple columns out of which at least one is of type 'category'

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`