Code Sample
import pandas as pd
import numpy as np
import time
n_rows = 1000
n_categories = 250
df = pd.DataFrame({'cat1': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
'cat2': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
'cat3': np.random.choice(np.arange(0,n_categories), size=n_rows, replace=True),
'real_numbers': np.random.normal(size=n_rows)})
# Case 1: Pivot table without multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2'])
print("Elapsed time:", time.time() - now)
# Case 2: Pivot table with multi-index columns
now = time.time()
pd.pivot_table(df, values='real_numbers', index='cat1', columns=['cat2', 'cat3'])
print("Elapsed time:", time.time() - now)
# Typecast a feature to categorical.
df_cast = df.astype({'cat2': 'category'})
# Case 3: Pivot table without multi-index columns - still OK
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2'])
print("Elapsed time:", time.time() - now)
# Case 4: Pivot table with multi-index columns - strong increase in elapsed time/ memory footprint.
now = time.time()
pd.pivot_table(df_cast, values='real_numbers', index=['cat1'], columns=['cat2', 'cat3'])
print("Elapsed time:", time.time() - now)
Problem description
Creating a pivot table (with a multi-index) of a relatively small data frame with integer and float columns (case 2) goes much faster and uses much less resources compared to when the pivot table is created of the same data frame, but with one of the columns converted to a category (case 4).
On my (fairly old) system I find roughly a 200x increase in elapsed time between cases 2 and 4 from the code sample. Furthermore, memory consumption in case 2 is negligible, while in case 4 it runs well over 2GB. Increasing the number of categories n_categories
in the original data frame the increases the resource usage/ execution time further.
On the other hand, when no multi-index is used, there only seems to be a very small drop in performance (e.g. from case 1 to case 3).
There are already various issues submitted that involve memory usage of the pivot_table function, however as far as I see for large data frames. This issue seems different as it happens with relatively small data frames, and only in the specific case of a categorical variable in a multi-index.
Expected Output
I am unaware of the implementation details, but I would assume that a categorical variable has an integer representation under the hood and I would therefore expect no performance penalty at all, regardless of the number of columns chosen.
Output of pd.show_versions()
Comment From: TomAugspurger
Do you mind doing some profiling of the two to find the difference? If you run a line profiler https://github.com/rkern/line_profiler on pd.pivot_table
for the two cases, the difference should stick out.
Comment From: MishaVeldhoen
Yes, I’ll look into it!
Comment From: jreback
duplicate of this issue: https://github.com/pandas-dev/pandas/issues/15217
pivot is just calling groupby.
Comment From: jreback
welcome to have you look at that one (and I suggest the soln there) :>
Comment From: suokunlong
I still reproduce this issue with pandas version 0.25.3. It seems never fix at all, may not be a duplicate of #15217.